Doug Cutting wrote:
J�r�me Charron wrote:

Then, the question is where the TheDateToCheck value comes from?
1. From the previously indexed document (I know that this information is stored): It certainly consumes more process time that the second solution. My knowledge of Nutch internal is not enougth to know how to retrieve quickly this information from the document's url... someone can help us on this point?


The most efficient place to store this would be in the pagedb. What's stored there currently is the nextFetch date and the fetchInterval. This could be changed to lastModified and fetchInterval, with nextFetch calculated as lastModified+fetchInterval. In UpdateDatabaseTool.java these can both be updated. If the lastModified has not changed then then fetchInterval can be increased accordingly.

If you remember, we discussed this some time ago. We came to a conclusion that in order to properly support varying fetchInterval its type needs to be changed to a float (or a range-reduced float like in Lucene), and a lot of changes need to be done in the API between the fetcher and plugins.


It would be a good time to start this now, when there are upcoming incompatible changes in the API and disk formats... Within a couple days I'll try to refresh the patchset I had then and put it up for review.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to