[
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450461#comment-13450461
]
Ferdy Galema commented on NUTCH-872:
------------------------------------
Christian, I ran a testcrawl with Nutch2.x branch and it does seem to work
right:
bin/nutch inject ~/urls/
bin/nutch generate
bin/nutch fetch -Dfetcher.parse=true -Dfetcher.store.content=false theBatchId
Now I check my HBase and the content family is empty for the fetched/parsed
urls. And they are parsed correctly.
If your problem persists, please try to explain in detail how you run the
crawl. (Also it is better to put it onto mailing list next time.)
> Change the default fetcher.parse to FALSE
> -----------------------------------------
>
> Key: NUTCH-872
> URL: https://issues.apache.org/jira/browse/NUTCH-872
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.2, 1.3, nutchgora
> Reporter: Andrzej Bialecki
>
> I propose to change this property to false. The reason is that it's a safer
> default - parsing issues don't lead to a loss of the downloaded content. For
> larger crawls this is the recommended way to run Fetcher. Users that run
> smaller crawls can still override it.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira