Re: lucene indexing and merge process

Ning Li Thu, 18 Oct 2007 11:19:53 -0700

I see what you mean by 2) now. What Mark said should work for you when
it's done.


Cheers,
Ning

On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote:
> Hi Ning:
>     That is essentially what field cache does. Doing this for each docid in
> the result set will be slow if the result set is large. But loading it in
> memory when opening index can also be slow if the index is large and updates
> often.
>
> Thanks
>
> -John
>
> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote:
> >
> > Make all documents have a term, say "ID:UID", and for each document,
> > store its UID in the term's payload. You can read off this posting
> > list to create your array. Will this work for you, John?
> >
> > Cheers,
> > Ning
> >
> >
> > On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> > > Forwarding this to java-dev per request.  Seems like the best place
> > > to discuss this topic.
> > >
> > >         Erik
> > >
> > >
> > > Begin forwarded message:
> > >
> > > > From: "John Wang" <[EMAIL PROTECTED]>
> > > > Date: October 17, 2007 5:43:29 PM EDT
> > > > To: [EMAIL PROTECTED]
> > > > Subject: lucene indexing and merge process
> > > >
> > > > Hi Erik:
> > > >
> > > >     We are revamping our search system here at LinekdIn. And we are
> > > > using Lucene.
> > > >
> > > >     One issue we ran across is that we store an UID in Lucene which
> > > > we map to the DB storage. So given a docid, to lookup its UID, we
> > > > have the following solutions:
> > > >
> > > > 1) Index it as a Stored field and get it from reader.document (very
> > > > slow if recall is large)
> > > > 2) Load/Warmup the FieldCache (for large corpus, loading up the
> > > > indexreader can be slow)
> > > > 3) construct it using the FieldCache and persist it on disk
> > > > everytime the index changes. (not suitable for real time indexing,
> > > > e.g. this process will degrade as # of documents get large)
> > > >
> > > >     None of the above solutions turn out to be adequate for our
> > > > requirements.
> > > >
> > > >      What we end up doing is to modify Lucene code by changing
> > > > SegmentReader,DocumentWriter,and FieldWriter classes by taking
> > > > advantage of the Lucene Segment/merge process. E.g:
> > > >
> > > >      For each segment, we store a .udt file, which is an int[]
> > > > array, (by changing the FieldWriter class)
> > > >
> > > >      And SegmentReader will load the .udt file into an array.
> > > >
> > > >      And merge happens seemlessly.
> > > >
> > > >      Because the tight encapsulation around these classes, e.g.
> > > > private and final methods, it is very difficult to extend Lucene
> > > > while avoiding branch into our own version. Is there a way we can
> > > > open up and make these classes extensible? We'd be happy to
> > > > contribute what we have done.
> > > >
> > > >      I guess to tackle the problem from a different angle: is there
> > > > a way to incorporate FieldCache into the segments (it is strictly
> > > > in memory now), and build disk versions while indexing.
> > > >
> > > >
> > > >      Hope I am making sense.
> > > >
> > > >     I did not send this out to the mailing list because I wasn't
> > > > sure if this is a dev question or an user question, feel free to
> > > > either forward it to the right mailing list or let me know and I
> > > > can forward it.
> > > >
> > > >
> > > > Thanks
> > > >
> > > > -John
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene indexing and merge process

Reply via email to