Re: Document Duplication for Multiple Segment Merge

Michael Ji Fri, 14 Oct 2005 10:26:39 -0700

hi Yonik:

Does that mean when two documents has same MD5 content
in two different segments, IndexMerger.java  will keep
both of them?


When I look at the code of IndexSegment.java, it
handle MD5 dedupling by keeping the one with higher
document ID.

So, when refetching happens, the old segment should be
discarded totally. And, a strategy must be made in
such a way that each segment should relate to a
fetchlist with same interval time. Is it the way Nutch
handling refetching case?


Michael Ji,

--- Yonik Seeley <[EMAIL PROTECTED]> wrote:

> There is no concept in Lucene of document identity
> linked to any fields of a
> document.
> You need to handle removal of duplicates yourself.
> 
> -Yonik
> Now hiring -- http://tinyurl.com/7m67g
> 
> 
> On 10/14/05, Michael Ji <[EMAIL PROTECTED]> wrote:
> >
> > hi,
> >
> > When Nutch's IndexMerger.java is called, the
> indexes
> > from multiple segment directories are merged to
> one
> > target directory.
> >
> > I wonder how lucene deals with the case when
> identical
> > documents existing in two segments. Is the older
> > document ( lower time stamp ) deleted?
> >
> > thanks,
> >
> > Michael Ji,
> >
> >
> >
> > __________________________________
> > Yahoo! Music Unlimited
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> >
> 



                
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document Duplication for Multiple Segment Merge

Reply via email to