Hi Andrzej,
thank you for taking the time to comment, I highly value your comments.
* I guess that for each case where Nutch seems inappropriate I
could give you a counter-example of Nutch being used commercially
with much success. I guess it depends on a particular application
and
Hi,
I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp files
since it takes a lot time when refactoring. So how about to adopt
using velocity/freemarker with servlet?
In general I agree it is the view layer and should
working with hadoop he/
she should feel free to update the patch and post it in the hadoop jira.
Stefan
On 18.01.2007, at 15:39, Doug Cutting wrote:
Stefan Groschupf wrote:
We run the gui in several production environemnts with patched
hadoop code - since this is from our point of view
dont know it is the right time to do this job.
On 1/19/07, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi,
I just finished reading all source code about nutch gui. And
personally i don't like putting a lot of code snippets into jsp
files
since it takes a lot time when refactoring. So how about
Hi,
great to hear people still working on things. It shows once more
getting something in early would save some effort. :)
Just some random comments.
We run the gui in several production environemnts with patched hadoop
code - since this is from our point of view the clean approach.
did you erver browse this: http://wiki.media-style.com/display/
nutchDocu/Home
Nothing big, but it will give you some ideas, also about plugins.
On 25.11.2006, at 06:32, Armel T. Nene wrote:
I agree with you that documentation is vital not the just extending
the
current version but also for
I guess non-official hadoop jar is out of the question (as it goes on
so rapidly). What are the modifications required, couldn't we start
without them?
Well than we would have a admin gui that does not work for local
installation but only for distributed installations.
See:
Hi,
try to have no regular expression filter and check if this helps.
Let me know if this solve the problem.
You may be want to do a thread dump and send the log to the list to
check where exactly the fetcher freezes.
Stefan
Am 03.11.2006 um 15:53 schrieb Aisha:
Hi,
I don't know why but
There is a eclipse java cc plugin.
It compiles your the grammar and you can write easily test code.
However it has it's own issues so you may just want to generate the
java files with the nutch ant script and write than unit tests again
these files.
HTH
Stefan
On 10.09.2006, at 00:49, heack
Another alternative would be to construct a new workflow that just
adds the Patch Available status and still permits issues to be re-
opened.
+1
-
Using Tomcat but need to do more? Need to support web services,
Hi Doug,
I'm pretty sure that your problem is related to the deduping of your
index.
In general the hash of the content of a page is used as key for the
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we
Hi,
+ You may have problems with some imports in parse-mp3 and parse-
rtf plugins. Because of incompatibility with apache licence they
were left from sources. You can find it here:
+
+ http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/
lib/
+
+
Hi Michi,
what is your motivation for that?
Stefan
Am 25.08.2006 um 06:52 schrieb Michael Wechner:
Hi
I think it would be very useful if the NutchBean would check if the
crawl dir exists and throw at least a warning
in case it doesn't:
Index:
Hi Renaud,
I think you were meaning editing: http://wiki.apache.org/nutch/
RunNutchInEclipse , not http://wiki.apache.org/nutch/
RenaudRichardet , right?
Right! Sorry for the misunderstanding.I have no idea about your
personal page so it would be a bad move to edit it. :-)
Thanks again
Hi Renaud,
I updated your page with some more details, I hope that is ok for you.
Thanks for creating it.
Stefan
Am 23.08.2006 um 11:51 schrieb Apache Wiki:
Dear Wiki user,
You have subscribed to a wiki page or wiki category on Nutch Wiki
for change notification.
The following page has
One must also remember that proper junit testing can be used to
verify functionality.
There's lot of code currently that is not guarded by unit tests and
I hereby invite everybody to participate in this endless effort and
make Nutch unit tests better ;)
I completely agree!!!
Nutch has
[
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ]
Stefan Groschupf commented on NUTCH-354:
Since this issue is already closed I can not attach the patch file, so I attach
it as text within this comment
Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls
from my POV.
Greetings,
Stefan
During AIRWeb'06 we announced the availability of the collection.
We are currently planning a Web Spam challenge based on the
[
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ]
Stefan Groschupf commented on NUTCH-356:
Hi Enrico,
there will be as much PluginRepositories as Configuration objects.
So in case you create many
Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8.1, 0.9.0
MapWritables recycle entries from it internal linked-List for performance
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is
found. This can cause wrong
[ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]
Stefan Groschupf updated NUTCH-354:
---
Attachment: resetNextEntryInMapWritableV1.patch
Resets the next Entry of a recycled entry.
MapWritable, nextEntry is not reset when Entries
[
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ]
Stefan Groschupf commented on NUTCH-343:
Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look
[
http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12428922 ]
Stefan Groschupf commented on NUTCH-342:
We should cleanup logging in nutch in general asap!
The way things are configured by today is everything else than
[
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ]
Stefan Groschupf commented on NUTCH-347:
Please submit this patch!
Thanks!
Build: plugins' Jars not found
[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Stefan Groschupf updated NUTCH-341:
---
Attachment: doNotDeleteTmpIndexMergeDirV1.patch
+1.
I agree it makes completly no sense to be required creating a tmp folder
manually and nutch deletes
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Attachment: respectFetcherParsePropertyV1.patch
Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a
contributor to commit
[ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]
Stefan Groschupf updated NUTCH-337:
---
Priority: Major (was: Trivial)
Fetcher ignores the fetcher.parse value configured in config file
[ http://issues.apache.org/jira/browse/NUTCH-336?page=all ]
Stefan Groschupf updated NUTCH-336:
---
Priority: Critical (was: Minor)
I think that is a fundamental problem since I observe there are many pages e.g.
presentation slides that have exactly
[ http://issues.apache.org/jira/browse/NUTCH-350?page=all ]
Stefan Groschupf updated NUTCH-350:
---
Attachment: protocolRetryV5.patch
This patch will dramatically increase the number of successfully fetched pages
of a intranet crawl over the time
[
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ]
Stefan Groschupf commented on NUTCH-322:
I think this is a serious problem. Page A server side redirect to Page B. Page
A is never writen to the output
: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8.1
Attachments: doNotRefecthForwarderPagesV1.patch
Pages that do a serverside forward are not written with a status change back
into the crawlDb. Also the nextFetchTime
[ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]
Stefan Groschupf resolved NUTCH-322.
Resolution: Duplicate
duplicate of NUTCH-353
Fetcher discards ProtocolStatus, doesn't store redirected pages
[ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]
Stefan Groschupf updated NUTCH-353:
---
Attachment: doNotRefecthForwarderPagesV1.patch
Since we discussed that nutch need to be more polite we should fix that asap.
pages that serverside
[
http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ]
Stefan Groschupf commented on NUTCH-346:
+1
I agree, can you please create a patch file and attach it to this bug.
Thanks
Improve readability of logs
[
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ]
Stefan Groschupf commented on NUTCH-345:
Shouldn't the DeflateUtils also be part of the protocol-http plugin?
Also since it is a larger contribution
[
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ]
Stefan Groschupf commented on NUTCH-349:
my vote goes to #2.
Having a tool that need to be started manually would be better than complicate
the already
[ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]
Stefan Groschupf updated NUTCH-348:
---
Attachment: sortPatchV1.patch
What people think about this kind of solution?
Generator is building fetch list using *lowest* scoring URLs
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ]
Stefan Groschupf commented on NUTCH-233:
Hi Otis,
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8-dev
When a page has no outlinks but several links to itself e.g. it has a set of
anchors the scores of the page are distributed to its outlinks. But all this
outlinks pointing to the page back. This causes
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ]
Stefan Groschupf commented on NUTCH-318:
Yes this happens only in a distributed environment. Please also see my last
mail in the hadoop dev list. I think
[
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ]
Stefan Groschupf commented on NUTCH-318:
Shouldn't that be fixed in .8 since by today this tool just produce no output?!
log4j not proper configured
[
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ]
Stefan Groschupf commented on NUTCH-233:
I think this should be fixed in .8 too, since everybody that does real whole
web crawl with over a 100 Mio pages
Hi developers,
we have command like readdb and readlinkdb but segread. Wouldn't be
more consistent to name the command readseg instead segread?
... just a thought.
Stefan
-
Take Surveys. Earn Cash. Influence the Future
I like it!
Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki:
Stefan Neufeind wrote:
Andrzej Bialecki wrote:
Stefan Groschupf wrote:
Hi developers,
we have command like readdb and readlinkdb but segread. Wouldn't
be more consistent to name the command readseg instead segread?
... just
Hi Sami,
I can not confirm this problem:
jacy:~/nutch-trunk-tmp joa$ svn update .
At revision 424865.
[...]
test:
BUILD SUCCESSFUL
Total time: 2 minutes 6 seconds
So it works for me.
Stefan
Am 23.07.2006 um 13:27 schrieb Sami Siren:
Svn trunk gave me failed on
testcase
Hi,
I remember there was a search result comparison tool within nutch.
Is that still alive? How to use it / find it? I was not able to find
it by browsing the trunk sources.
Is there any such a tool people can suggest to compare search results
with yahoo or google result to play with
Reporter: Stefan Groschupf
Priority: Minor
Fix For: 0.8-dev
processTopNJob runs two job and both have no jobname setted.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http
-325
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
Fix For: 0.8-dev
In URLFilters constructor we use an array as long as we have filters defined in
the urlfilter.order property
[ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]
Stefan Groschupf updated NUTCH-325:
---
Attachment: UrlFiltersNPE.patch
A patch that uses a Arralist instead of an array and put only entries into the
list when the entry is not null. Means
Hi Developers,
another thing in the discussion to be more polite.
I suggest that we log a message in case an requested URL was blocked
by a robots.txt.
Optimal would be if we only log this message in case the current used
agent name is only blocked and it is not a general blocking of all
Hi developers,
in nutch-default.xml property plugin.includes we say: In any case
you need at least include the nutch-extensionpoints plugin.
But we do not include it by default.
valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-
I may - but since you know the details of the plugin subsystem,
tell me what _should_ be there? I.e. should we really include it in
the plugin.includes list, or not?
This is a philosophically question.
I personal prefer restrict definitions, since applications behavior
is better
Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
Fix For: 0.8-dev
Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to a
other.
Also a reference of the MapWritable is passed. Means both
[ http://issues.apache.org/jira/browse/NUTCH-323?page=all ]
Stefan Groschupf updated NUTCH-323:
---
Attachment: MapWritableCopyConstructor.patch
Attached patch add a copy constructor to the map writable and use it in the
CrawlDatum.set methode. However
Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical
Configuration properties db.score.link.external and db.score.link.internal are
ignored.
In case of e.g. message board webpages or pages that have larger navigation
menus on each page having a lower impact
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]
Stefan Groschupf updated NUTCH-324:
---
Attachment: InternalAndExternalLinkScoreFactor.patch
Multiply the score of a page during distributeScoreToOutlink with
db.score.link.internal
[ http://issues.apache.org/jira/browse/NUTCH-319?page=all ]
Stefan Groschupf resolved NUTCH-319.
Resolution: Won't Fix
Sorry, that is bogus since it is wriiten to the logging stream.
OPICScoringFilter should use logging API instead
Hi,
shouldn't db.max.inlinks be in the nutch-default.xml configuration?
Stefan
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on
Andrzej,
... in LinkDB line 114, in the configure method and it is used in
line 168 and 176.
Stefan
Am 18.07.2006 um 16:02 schrieb Andrzej Bialecki:
Stefan Groschupf wrote:
Hi,
shouldn't db.max.inlinks be in the nutch-default.xml configuration?
Where this is used?
--
Best regards
Hi,
OPICScoringFilter line 91:
content.getMetadata().set(Fetcher.SCORE_KEY, + datum.getScore());
and line 96,102 we set and get the Fetch Sore as Strings. :-o.
Wouldn't it be better to have the Metadata support floats as well
instead of serializing and parsing strings?
In general wouldn't it
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki
Priority: Trivial
Fix For: 0.8-dev
OPICScoringFilter line 107 should be a logging not a
e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?
--
This message
changes in
verions 0.8. The problem is the log message does not say what file
is not
found. So, it's hard to debug. Any idea?
Thanks,
AJ
On 7/9/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Try to put the conf folder to your classpath in eclipse and set the
environemnt variables
: Stefan Groschupf
Priority: Critical
Fix For: 0.8-dev
In the latest .8 sources the readdb command doesn't dump any information
anymore.
This is realeated to the miss configured log4j.properties file.
changing:
log4j.rootLogger=INFO,DRFA
to:
log4j.rootLogger=INFO,DRFA,stdout
dumps
Try to put the conf folder to your classpath in eclipse and set the
environemnt variables that are setted in bin/nutch.
Btw, please do not crosspost.
Thanks.
Stefan
Am 09.07.2006 um 21:47 schrieb AJ Chen:
I checked out the 0.8 code from trunk and tried to set it up in
eclipse.
When
Hi,
this question is difficult to answer and may be there more experts in
the nutch user list than in the developer list.
In nutch 0.8 you can use the new scoring api to change the scoring of
a page for being scheduled for crawling based on the it's scores.
Have a look to the opic score
Hi Jérôme,
I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.
Stefan
On 06.07.2006, at 08:54, Jérôme Charron wrote:
Hi,
I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to
We tried your suggested fix:
Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
and this worked without any problem.
Thanks for catching that, this saved us a lot of time.
Stefan
On 07.07.2006, at 16:08, Jérôme Charron wrote:
I have the same problem on
+1, but I really would love to see NUTCH-293 as part of nutch .8
since this all about being more polite.
Thanks.
Stefan
On 05.07.2006, at 03:46, Doug Cutting wrote:
+1
Piotr Kosiorowski wrote:
+1.
P.
Andrzej Bialecki wrote:
Sami Siren wrote:
How would folks feel about releasing 0.8
Hi,
as far I can see nutch's html parser does only support the meta tag
noindex (meta name=ROBOTS content=NOINDEX,NOFOLLOW ) but there
is an inoffiziel html noindex tag.
http://www.webmasterworld.com/forum10003/2703.htm
May be this would be another thing to make nutch more polite.
Also please
wrong configured log4j.properties
-
Key: NUTCH-307
URL: http://issues.apache.org/jira/browse/NUTCH-307
Project: Nutch
Type: Bug
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8-dev
In nutch/conf is only one
Hi Feng,
map Writrable is a kind of hashmap.
You can put in any key value pair, but the key and values need to be
Writables:
http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/
Writable.html
You can use UTF8 as StingKey and Value or ByteWritable as key and
Utf8 as Values.
Etc.
Hi Lourival,
this means all pages older than 30 days are potential candidates for
a fetch list that is created by segment generation process.
Stefan
Am 12.06.2006 um 16:33 schrieb Lourival Júnior:
Hi all!
I have a question about nutch-default.xml configuration file. There
is a
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]
Stefan Groschupf updated NUTCH-289:
---
Attachment: ipInCrawlDatumDraftV5.patch
Release Candidate 1 of this patch.
This patch contains:
+ add IP Address to CrawlDatum Version 5 (as byte[4
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]
Stefan Groschupf updated NUTCH-289:
---
Attachment: ipInCrawlDatumDraftV4.patch
Attached a patch that does only use any time 4 byte for the ip. Means we do
ignore ipv6. This save us a 4 byte
[
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ]
Stefan Groschupf commented on NUTCH-293:
Any comments? There was already a posting in the nutch agent mailing list,
where someone had banned nutch since nutch does
As far I understand hadoop use commons logging. Should we switch to
use commons logging as well?
Am 06.06.2006 um 11:02 schrieb Jérôme Charron:
URL: http://svn.apache.org/viewvc?rev=411943view=rev
Log:
Updating to Hadoop release 0.3.1. Hadoop now uses Jakarta Commons
Logging, configured
Hi,
is there a known problem with hadop .3.1 and nutch classloading or
job file usage?
I wrote a custom tool and want to start it via:
bin/nutch myclass crawldb 1000
But found only following exception in the task reporter messages:
java.lang.RuntimeException: java.lang.RuntimeException:
[
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ]
Stefan Groschupf commented on NUTCH-258:
Scott,
I agree with you. However we need a clean patch to solve the problem, we can
not just comment things out of the code
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]
Stefan Groschupf updated NUTCH-289:
---
Attachment: ipInCrawlDatumDraftV1.patch
To keep the discussion alive attached a _first draft_ for storing the ip in the
crawlDatum for public discussion
I have a proposal for a simple solution: set a flag in the current
Configuration instance, and check for this flag. The Configuration
instance provides a task-specific context persisting throughout the
lifetime of a task - but limited only to that task. Voila - problem
solved. We get
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
Stefan Groschupf updated NUTCH-298:
---
Summary: if a 404 for a robots.txt is returned a NPE is thrown (was: if a
404 for a robots.txt is returned no page is fetched at all from the host
Hi,
a interesting tool:
http://tool.motoricerca.info/spam-detector/
Stefan
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers
The idea to have
someething like this as a nutch-module (dropping pages or ranking them
very low) might come up :-)
This will be a very long way.
I collect some thoughts and a list of web spam related papers in my
blog.
http://www.find23.net/Web-Site/blog/521BA1CD-14C4-4E84-A072-
sandbox svn folder
--
Key: NUTCH-297
URL: http://issues.apache.org/jira/browse/NUTCH-297
Project: Nutch
Type: Sub-task
Reporter: Stefan Groschupf
Assigned to: Doug Cutting
Priority: Trivial
Having a svn sandbox repository would allow
: Stefan Groschupf
Fix For: 0.8-dev
What happen:
Is no RobotRuleSet is in the cache for a host, we create try to fetch the
robots.txt.
In case http response code is not 200 or 403 but for example 404 we do
robotRules = EMPTY_RULES; (line: 402)
EMPTY_RULES is a RobotRuleSet created
[ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]
Stefan Groschupf updated NUTCH-298:
---
Attachment: fixNpeRobotRuleSet.patch
fix the npe in RobotRuleSet happen in case we use a empthy RuleSet
if a 404 for a robots.txt is returned no page
Hi,
just posted a fix for a NPE in case a empty RobotRuleSet is used.
The patch only contains a two lines fix, since I learned that this
best way to get things committed sooner. :)
However I really don't like the RobotRuleSet implementation since
entries are copied between a arraylist and a
[
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ]
Stefan Groschupf commented on NUTCH-282:
Is that related to host grouping we discussed? Can we in this case close this
bug?
Showing too few results on a page (Paging
[
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ]
Stefan Groschupf commented on NUTCH-286:
This is difficult to realize since the http error code is readed from response
in the fetcher and setted into the protocol
[
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ]
Stefan Groschupf commented on NUTCH-292:
+1, Can someone create a clean patch file?
OpenSearchServlet: OutOfMemoryError: Java heap space
[
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ]
Stefan Groschupf commented on NUTCH-291:
lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified
[
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ]
Stefan Groschupf commented on NUTCH-290:
If a parser throws an exeption:
Fetcher, 261:
try {
parse = this.parseUtil.parse(content);
parseStatus
[ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
Stefan Groschupf closed NUTCH-287:
--
Resolution: Won't Fix
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html
Exception when searching with sort
[ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
Stefan Groschupf closed NUTCH-284:
--
Resolution: Won't Fix
Yes, I was missing index-basic.
NullPointerException during index
-
Key: NUTCH-284
[
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ]
Stefan Groschupf commented on NUTCH-284:
Please try discuss such things first in the user mailing list than open a
issue.
Maintaining the issue tracking is very time
[
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ]
Stefan Groschupf commented on NUTCH-281:
Can you submit a patch file?
cached.jsp: base-href needs to be outside comments
[
http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414456 ]
Stefan Groschupf commented on NUTCH-275:
Should we switch off mime.type.magic by default?
Some people was reporting the same problems.
Fetcher not parsing XHTML
[
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ]
Stefan Groschupf commented on NUTCH-274:
Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the
Injector?
Empty row in/at end of URL-list
[ http://issues.apache.org/jira/browse/NUTCH-274?page=all ]
Stefan Groschupf updated NUTCH-274:
---
Attachment: ignoreEmpthyLineDuringInjectV1.patch
Ignore empthy lines during injecting.
Thanks for spotting this Stefan!
Empty row in/at end of URL-list
[
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ]
Stefan Groschupf commented on NUTCH-290:
As far I understand the code, the next parser is only used if the previous
parser return with a unsuccessfully paring status
1 - 100 of 647 matches
Mail list logo