Robert Young wrote:
I have been trying to get to grips with
org.apache.nutch.crawl.Injector to help with a requirement I have for
the project I'm working on and I'm a little confused about one place.
On lines 120 - 121 any existing CrawlDatum is used instead of the
newly injected one. This doesn't seem to make sense from my point of
view, I'm guessing it's just a matter of not being able to see the
issue from the other side. The scenario I an in is as such, when I
inject a url it is because I want it to be re-indexed, maybe because
it's changed, I don't care if that url's already in the crawldb I want
it re-indexed. As far as I can see, if this wasn't the case I wouldn't
be trying to inject it.
What am I missing here? Why is the existing CrawlDatum used instead of
the newly injected one?
That's indeed a place in Nutch that I planned to change for a long time
... This behavior is not obvious, what's worse it's undocumented.
It would be relatively simple to extend this behavior so that only
selected parts of data would be updated or replaced when a seed list
contains the same URL as the one already in CrawlDb.
For now, just add the code that you need in Injector.InjectReducer.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com