Chetan Sahasrabudhe wrote:
I do some processing on my master index.
"dedup" does not guarantee that it will delete newly inserted record or old 
record.

I am looking for something that can be done through nutch API and that too at 
merge time.

in case there is not ready to use way to do selective merge, then is there a 
way to parse through all URLs in the index.
something that might return enum or array of all URLs.

I might iterate through index A url list and later take decision as to insert 
it or not.

Index A size will be small enough to iterate through each URL. so i dont have that issue with performance.

You can extend the SegmentMergeTool, specifically the code in run() method around line 293 (the while loop).


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Reply via email to