Re: Not renewing CrawlDatum on Inject

Robert Young Tue, 10 Jul 2007 01:19:48 -0700

Would you say it's worth writing it up as a patch and adding it to JIRA?


On 7/9/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Robert Young wrote:
> I have been trying to get to grips with
> org.apache.nutch.crawl.Injector to help with a requirement I have for
> the project I'm working on and I'm a little confused about one place.
> On lines 120 - 121 any existing CrawlDatum is used instead of the
> newly injected one. This doesn't seem to make sense from my point of
> view, I'm guessing it's just a matter of not being able to see the
> issue from the other side. The scenario I an in is as such, when I
> inject a url it is because I want it to be re-indexed, maybe because
> it's changed, I don't care if that url's already in the crawldb I want
> it re-indexed. As far as I can see, if this wasn't the case I wouldn't
> be trying to inject it.
>
> What am I missing here? Why is the existing CrawlDatum used instead of
> the newly injected one?

That's indeed a place in Nutch that I planned to change for a long time
... This behavior is not obvious, what's worse it's undocumented.

It would be relatively simple to extend this behavior so that only
selected parts of data would be updated or replaced when a seed list
contains the same URL as the one already in CrawlDb.

For now, just add the code that you need in Injector.InjectReducer.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Not renewing CrawlDatum on Inject

Reply via email to