It seems like there are (at least) two angles here for getting better performance from FieldCache:
1) Be incremental: with reopen() we should only have to update a subset of the array in the FieldCache, according to the changed segments. This is what Hoss is working on and Mark was referring to and I think it's very important! 2) Parsing is slow (?): I'm guessing one of the reasons that John added the _X.udt file was because it's much faster to load an array of already-parsed ints than to ask FieldCache to populate itself. Even if we do #1, I think #2 could be a big win (in addition)? John do you have any numbers of how much faster it is to load the array of ints from the _X.udt file vs having FieldCache populate itself? Also on the original question of "can we open up SegmentReader, FieldsWriter, etc.", I think that's a good idea? At least we can make things protected instead of private/final? Mike "Ning Li" <[EMAIL PROTECTED]> wrote: > I see what you mean by 2) now. What Mark said should work for you when > it's done. > > Cheers, > Ning > > On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote: > > Hi Ning: > > That is essentially what field cache does. Doing this for each docid in > > the result set will be slow if the result set is large. But loading it in > > memory when opening index can also be slow if the index is large and updates > > often. > > > > Thanks > > > > -John > > > > On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote: > > > > > > Make all documents have a term, say "ID:UID", and for each document, > > > store its UID in the term's payload. You can read off this posting > > > list to create your array. Will this work for you, John? > > > > > > Cheers, > > > Ning > > > > > > > > > On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: > > > > Forwarding this to java-dev per request. Seems like the best place > > > > to discuss this topic. > > > > > > > > Erik > > > > > > > > > > > > Begin forwarded message: > > > > > > > > > From: "John Wang" <[EMAIL PROTECTED]> > > > > > Date: October 17, 2007 5:43:29 PM EDT > > > > > To: [EMAIL PROTECTED] > > > > > Subject: lucene indexing and merge process > > > > > > > > > > Hi Erik: > > > > > > > > > > We are revamping our search system here at LinekdIn. And we are > > > > > using Lucene. > > > > > > > > > > One issue we ran across is that we store an UID in Lucene which > > > > > we map to the DB storage. So given a docid, to lookup its UID, we > > > > > have the following solutions: > > > > > > > > > > 1) Index it as a Stored field and get it from reader.document (very > > > > > slow if recall is large) > > > > > 2) Load/Warmup the FieldCache (for large corpus, loading up the > > > > > indexreader can be slow) > > > > > 3) construct it using the FieldCache and persist it on disk > > > > > everytime the index changes. (not suitable for real time indexing, > > > > > e.g. this process will degrade as # of documents get large) > > > > > > > > > > None of the above solutions turn out to be adequate for our > > > > > requirements. > > > > > > > > > > What we end up doing is to modify Lucene code by changing > > > > > SegmentReader,DocumentWriter,and FieldWriter classes by taking > > > > > advantage of the Lucene Segment/merge process. E.g: > > > > > > > > > > For each segment, we store a .udt file, which is an int[] > > > > > array, (by changing the FieldWriter class) > > > > > > > > > > And SegmentReader will load the .udt file into an array. > > > > > > > > > > And merge happens seemlessly. > > > > > > > > > > Because the tight encapsulation around these classes, e.g. > > > > > private and final methods, it is very difficult to extend Lucene > > > > > while avoiding branch into our own version. Is there a way we can > > > > > open up and make these classes extensible? We'd be happy to > > > > > contribute what we have done. > > > > > > > > > > I guess to tackle the problem from a different angle: is there > > > > > a way to incorporate FieldCache into the segments (it is strictly > > > > > in memory now), and build disk versions while indexing. > > > > > > > > > > > > > > > Hope I am making sense. > > > > > > > > > > I did not send this out to the mailing list because I wasn't > > > > > sure if this is a dev question or an user question, feel free to > > > > > either forward it to the right mailing list or let me know and I > > > > > can forward it. > > > > > > > > > > > > > > > Thanks > > > > > > > > > > -John > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]