[ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450461#comment-13450461
 ] 

Ferdy Galema commented on NUTCH-872:
------------------------------------

Christian, I ran a testcrawl with Nutch2.x branch and it does seem to work 
right:

bin/nutch inject ~/urls/
bin/nutch generate
bin/nutch fetch -Dfetcher.parse=true -Dfetcher.store.content=false theBatchId

Now I check my HBase and the content family is empty for the fetched/parsed 
urls. And they are parsed correctly.

If your problem persists, please try to explain in detail how you run the 
crawl. (Also it is better to put it onto mailing list next time.)
                
> Change the default fetcher.parse to FALSE
> -----------------------------------------
>
>                 Key: NUTCH-872
>                 URL: https://issues.apache.org/jira/browse/NUTCH-872
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.2, 1.3, nutchgora
>            Reporter: Andrzej Bialecki 
>
> I propose to change this property to false. The reason is that it's a safer 
> default - parsing issues don't lead to a loss of the downloaded content. For 
> larger crawls this is the recommended way to run Fetcher. Users that run 
> smaller crawls can still override it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to