Hi Michael: Thanks for the info. I haven't played with payloads. Can you give me an example or point me to how it is used to solve this problem?
Thanks -John On 10/19/07, Michael Busch <[EMAIL PROTECTED]> wrote: > > John Wang wrote: > > > > I can tried to get some numbers for leading an int[] array vs > > FieldCache.getInts(). > > I've had a similar performance problem when I used the FieldCache. The > loading performance is apparently so slow, because each value is stored > as a term in the dictionary. For loading the cache it is necessary to > iterate over all terms for the field in the dictionary. And for each > term it's posting list is opened to check which documents have that value. > > If you store unique docIds, then there are no two documents that share > the same value. That means, that each value gets its own entry in the > dictionary and to load each value it is necessary to perform two random > I/O seeks (one for term lookup + one to open the posting list). > > In my app it took for a big index several minutes to fill the cache like > that. > > To speed things up I did essentially what Ning suggested. Now I store > the values as payloads in the posting list of an artificial term. To > fill my cache it's only necessary to perform a couple of I/O seeks for > opening the posting list of the specific term, then it is just a > sequential scan to load all values. With this approach the time for > filling the cache went down from minutes to seconds! > > Now this approach is already much better than the current field cache > implementation, but it still can be improved. In fact, we already have a > mechanism for doing that: the norms. Norms are stored with a fixed size, > which means both random access and sequential scan are optimal. Norms > are also cached in memory, and filling that cache is much faster > compared to the current FieldCache approach. > > I was therefore thinking about adding per-document payloads to Lucene > (we can also call it document-metadata). The API could look like this: > > Document d = new Document(); > byte[] uidValue = ... > d.addMetadata("uid", uidValue); > > And on the retrieval side all values could either be loaded into the > field cache, or, if the index is too big, a new API can be used: > > IndexReader reader = IndexReader.open(...); > DocumentMetadataIterator it = reader.metadataIterator("uid"); > > where DocumentMetadataIterator is an interface similar to TermDocs: > > interface DocumentMetadataIterator { > void seek(String name); > boolean next(); > boolean skipTo(int doc); > > int doc(); > byte[] getMetadata(); > } > > The next question would be how to store the per-doc payloads (PDP). If > all values have the same length (as the unique docIds), then we should > store them as efficiently as possible, like the norms. However, we still > want to offer the flexibility of having variable-length values. For this > case we could use a new data structure similar to our posting list. > > PDPList --> FixedLengthPDPList | <VariableLengthPDPList, > SkipList> > FixedLengthPDPList --> <Payload>^SegSize > VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload> > Payload --> Byte^PayloadLength > PayloadLength --> VInt > SkipList --> see frq.file > > Because we don't have global field semantics Lucene should automatically > pick the "right" data structure. This could work like this: When the > DocumentsWriter writes a segment it checks whether all values of a PDP > have the same length. If yes, it stores them as FixedLengthPDPList, if > not, then as VariableLengthPDPList. > When the SegmentMerger merges two or more segments it checks if all > segments have a FixedLengthPDPList with the same length for a PDP. If > not, it writes a VariableLengthPDPList to the new segment. > > I think this would be a nice new feature for Lucene. We could then have > user-defined and Lucene-specific PDPs. For example, norms would be in > the latter category (this way we would get rid of the special code for > norms, as they could be handled as PDPs). It would also be easy to add > new features in the future, like splitting the norms into two values: a > norm and a boost value. > > OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;) > > - Michael > > > > > Thanks > > > > -John > > > > On 10/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > >> > >> It seems like there are (at least) two angles here for getting better > >> performance from FieldCache: > >> > >> 1) Be incremental: with reopen() we should only have to update a > >> subset of the array in the FieldCache, according to the changed > >> segments. This is what Hoss is working on and Mark was referring > >> to and I think it's very important! > >> > >> 2) Parsing is slow (?): I'm guessing one of the reasons that John > >> added the _X.udt file was because it's much faster to load an > >> array of already-parsed ints than to ask FieldCache to populate > >> itself. > >> > >> Even if we do #1, I think #2 could be a big win (in addition)? John > >> do you have any numbers of how much faster it is to load the array of > >> ints from the _X.udt file vs having FieldCache populate itself? > >> > >> Also on the original question of "can we open up SegmentReader, > >> FieldsWriter, etc.", I think that's a good idea? At least we can make > >> things protected instead of private/final? > >> > >> Mike > >> > >> "Ning Li" <[EMAIL PROTECTED]> wrote: > >>> I see what you mean by 2) now. What Mark said should work for you when > >>> it's done. > >>> > >>> Cheers, > >>> Ning > >>> > >>> On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote: > >>>> Hi Ning: > >>>> That is essentially what field cache does. Doing this for each > >> docid in > >>>> the result set will be slow if the result set is large. But loading > it > >> in > >>>> memory when opening index can also be slow if the index is large and > >> updates > >>>> often. > >>>> > >>>> Thanks > >>>> > >>>> -John > >>>> > >>>> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote: > >>>>> Make all documents have a term, say "ID:UID", and for each document, > >>>>> store its UID in the term's payload. You can read off this posting > >>>>> list to create your array. Will this work for you, John? > >>>>> > >>>>> Cheers, > >>>>> Ning > >>>>> > >>>>> > >>>>> On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: > >>>>>> Forwarding this to java-dev per request. Seems like the best > >> place > >>>>>> to discuss this topic. > >>>>>> > >>>>>> Erik > >>>>>> > >>>>>> > >>>>>> Begin forwarded message: > >>>>>> > >>>>>>> From: "John Wang" <[EMAIL PROTECTED]> > >>>>>>> Date: October 17, 2007 5:43:29 PM EDT > >>>>>>> To: [EMAIL PROTECTED] > >>>>>>> Subject: lucene indexing and merge process > >>>>>>> > >>>>>>> Hi Erik: > >>>>>>> > >>>>>>> We are revamping our search system here at LinekdIn. And we > >> are > >>>>>>> using Lucene. > >>>>>>> > >>>>>>> One issue we ran across is that we store an UID in Lucene > >> which > >>>>>>> we map to the DB storage. So given a docid, to lookup its UID, > >> we > >>>>>>> have the following solutions: > >>>>>>> > >>>>>>> 1) Index it as a Stored field and get it from reader.document(very > >>>>>>> slow if recall is large) > >>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the > >>>>>>> indexreader can be slow) > >>>>>>> 3) construct it using the FieldCache and persist it on disk > >>>>>>> everytime the index changes. (not suitable for real time > >> indexing, > >>>>>>> e.g. this process will degrade as # of documents get large) > >>>>>>> > >>>>>>> None of the above solutions turn out to be adequate for our > >>>>>>> requirements. > >>>>>>> > >>>>>>> What we end up doing is to modify Lucene code by changing > >>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking > >>>>>>> advantage of the Lucene Segment/merge process. E.g: > >>>>>>> > >>>>>>> For each segment, we store a .udt file, which is an int[] > >>>>>>> array, (by changing the FieldWriter class) > >>>>>>> > >>>>>>> And SegmentReader will load the .udt file into an array. > >>>>>>> > >>>>>>> And merge happens seemlessly. > >>>>>>> > >>>>>>> Because the tight encapsulation around these classes, e.g. > >>>>>>> private and final methods, it is very difficult to extend Lucene > >>>>>>> while avoiding branch into our own version. Is there a way we > >> can > >>>>>>> open up and make these classes extensible? We'd be happy to > >>>>>>> contribute what we have done. > >>>>>>> > >>>>>>> I guess to tackle the problem from a different angle: is > >> there > >>>>>>> a way to incorporate FieldCache into the segments (it is > >> strictly > >>>>>>> in memory now), and build disk versions while indexing. > >>>>>>> > >>>>>>> > >>>>>>> Hope I am making sense. > >>>>>>> > >>>>>>> I did not send this out to the mailing list because I wasn't > >>>>>>> sure if this is a dev question or an user question, feel free to > >>>>>>> either forward it to the right mailing list or let me know and I > >>>>>>> can forward it. > >>>>>>> > >>>>>>> > >>>>>>> Thanks > >>>>>>> > >>>>>>> -John > >>>>>>> > >>>>>> > >>>>>> > >> --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>>> > >>>>>> > >>>>> > >> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>> > >>>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >