hi Yonik: Does that mean when two documents has same MD5 content in two different segments, IndexMerger.java will keep both of them?
When I look at the code of IndexSegment.java, it handle MD5 dedupling by keeping the one with higher document ID. So, when refetching happens, the old segment should be discarded totally. And, a strategy must be made in such a way that each segment should relate to a fetchlist with same interval time. Is it the way Nutch handling refetching case? Michael Ji, --- Yonik Seeley <[EMAIL PROTECTED]> wrote: > There is no concept in Lucene of document identity > linked to any fields of a > document. > You need to handle removal of duplicates yourself. > > -Yonik > Now hiring -- http://tinyurl.com/7m67g > > > On 10/14/05, Michael Ji <[EMAIL PROTECTED]> wrote: > > > > hi, > > > > When Nutch's IndexMerger.java is called, the > indexes > > from multiple segment directories are merged to > one > > target directory. > > > > I wonder how lucene deals with the case when > identical > > documents existing in two segments. Is the older > > document ( lower time stamp ) deleted? > > > > thanks, > > > > Michael Ji, > > > > > > > > __________________________________ > > Yahoo! Music Unlimited > > Access over 1 million songs. Try it free. > > http://music.yahoo.com/unlimited/ > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > __________________________________ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]