Hi Jessica and Brooks,

On Fri, Jun 19, 2015 at 10:06 AM, <user-digest-h...@nutch.apache.org> wrote:

[snip]


>
>         Notice the 'prevFetchTime' field has been updated to show the next
> date when this URL should be crawled (30 days from now - July 19).  I
> assume this is exactly what SHOULD happen.
>

Correct.


>
>         Note, the tstamp is a month from now.


ack


> I'm not sure if nutch relies on the data in elasticsearch to know when it
> should reindex (though I don't see why it would - that decision would be
> made based on when it needs to refetch and whether or not anything has
> changed, right?).
>

No Nutch does not rely upon data in Elasticsearch. Crawling and Indexing
are separate independent tasks.


>
>         I would think that even IF Nutch needs to have the future date in
> Elasticsearch, it should send in the actual fetch time (i.e. the
> 'prevFetchTime' field).
>

Correct. There seems to be a big here which you have both identified.


>
>         I've been looking through some of the source code and the problem
> does NOT appear to be in the Elasticsearch Indexer plugin as it simply
> iterates through all of the key/value pairs and inserts them.
>

Same and yes I confirm this is true.

The bug is in BasicIndexingFilter

https://github.com/apache/nutch/blob/2.x/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java#L127-L130

We need to do a check for preFetchTime being null, if so then use the
fetchTime else use the prevFetchTime.

Can one of you please open an issue and submit a patch fix for this? If not
then I can create and submit. This is a trivial fix which is one which we
need to implement.
Good catch folks.
By the way, this effects trunk as well
https://github.com/apache/nutch/blob/trunk/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java#L136
Lewis

Reply via email to