[ 
https://issues.apache.org/jira/browse/NUTCH-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178568#comment-15178568
 ] 

Lewis John McGibbney commented on NUTCH-2222:
---------------------------------------------

Nice [~abenjell], 
In Nutch we use 
[MemStore|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/storage/StorageUtils.java#L90-L95]
 for the Gora datastore implementation for tests in Nutch 2.x.
If you could use this implementation it would be great. This greatly speeds up 
tests and also means we don't introduce more dependencies in Nutch.
Thank you for working on a solution. I propose to push a release candidate once 
we fix this bug.

> re-fetch deletes all  metadata except _csh_ and _rs_
> ----------------------------------------------------
>
>                 Key: NUTCH-2222
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2222
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 2.3.1
>         Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>            Reporter: Adnane B.
>            Assignee: Lewis John McGibbney
>             Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>60</value>
>   <description>The default number of seconds between re-fetches of a page (1 
> minute)
> </description>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to