Hi Michael: I took your program and benchmarked against my setup, here are some numbers comparing to the other options:
Setup: 2M docs with only the id, indexed in various ways for each method randomly selected 50000 of the docids and do a lookup. Comparing 3 methods: 1) load int[] from field cache Field(fldname,uid,Stored.No,Index.Tokenized); lucene 2.2 2) using payload (your impl) lucene 2.2 3) using our custom made lucene, described by my early email no fields added, just doc.setUserID(uid); // since we are adding our own segment file. (custome lucene 2.0) Since all three methods loads docids into an int[], the lookup time is the same for all three methods, what's different are the load times: 1) 16.5 seconds, 43 MB 2) 590 milliseconds 32.5 MB 3) 186 milliseconds 26MB I think the payload method is good enough so we don't need to diverge from the lucene code base. However, I feel that being able to customize the indexing process and store our own file is still more efficient both in load time and index size. Thanks -John On 10/20/07, Michael Busch <[EMAIL PROTECTED]> wrote: > > John Wang wrote: > > Hi Michael: > > Thanks for the info. > > > > I haven't played with payloads. Can you give me an example or point > me > > to how it is used to solve this problem? > > > > Hi John, > > I (quickly) put together a class that is able to store UIDs as payloads. > I believe the type of your UIDs is Integer? > > To add the UID to a document use this method: > /** > * Adds a unique docId (UID) to a document as a payload. > */ > public void addUID(Document doc, int uid); > > You can either load the UID from the disk or use a cache: > > /** Returns the UID for the passed-in docId. > * If you use this method in a Hitcollector and your Query > * contains OR-terms, then try to set > * BooleanQuery.setAllowDocsOutOfOrder(false) to improve performance. > */ > public int getUID(IndexReader reader, int docId) throws IOException; > > /** Fills the passed-in array with UID-values. If the given array is > * null or too small, then a new array is created with > * cache.length = reader.maxDoc() > */ > public int[] getCachedIds(IndexReader reader, int[] cache, int offset) > throws IOException; > > I put a little test program in main and it seems to work fine. However, > it's not thoroughly tested yet... > > You might want to try it without using the cache first. The performance > might be good enough for your needs. If not, try using the cache, it > should fill up much faster compared to the FieldCache. > > Another comment: If you're using Lucene 2.2, you need to replace the > lines "tp.seek(UID_TERM);" (see comments in the code below). Or use the > latest trunk version, it has a fix for this bug. > > Let me know please if this improves your performance! Have fun... > - Michael > > And here is the code: > > import java.io.IOException; > > import org.apache.lucene.analysis.Token; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.document.Field.Index; > import org.apache.lucene.document.Field.Store; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.index.Payload; > import org.apache.lucene.index.Term; > import org.apache.lucene.index.TermPositions; > import org.apache.lucene.store.Directory; > import org.apache.lucene.store.RAMDirectory; > > public class PerDocPayloadReaderWriter { > public static final Term UID_TERM = new Term("_ID", "_UID"); > private SinglePayloadTokenStream singlePayloadTokenStream = new > SinglePayloadTokenStream(); > private TermPositions tp; > private byte[] payloadBuffer = new byte[4]; > > public static void main(String args[]) throws IOException { > PerDocPayloadReaderWriter pdp = new PerDocPayloadReaderWriter(); > Directory dir = new RAMDirectory(); > IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer()); > for (int i = 0; i < 10; i++) { > Document d = new Document(); > // create dummy doc > d.add(new Field("test", "This is a test.", Store.NO, > Index.TOKENIZED)); > pdp.addUID(d, 100 + i); > > writer.addDocument(d); > } > writer.close(); > > IndexReader reader = IndexReader.open(dir); > int[] uids = pdp.getCachedIds(reader, null, 0); > > System.out.println("Caching:"); > System.out.println("ID --> UID"); > > for (int i = 0; i < uids.length; i++) { > System.out.println(i + " --> " + uids[i]); > } > > System.out.println("\nDirect access:"); > System.out.println("ID --> UID"); > > for (int i = 0; i < uids.length; i++) { > System.out.println(i + " --> " + pdp.getUID(reader, i)); > } > reader.close(); > } > > > /** Fills the passed-in array with UID-values. If the given array is > null or too small, then > * a new array is created with cache.length = reader.maxDoc() > */ > public int[] getCachedIds(IndexReader reader, int[] cache, int offset) > throws IOException { > int maxDoc = reader.maxDoc(); > if (cache == null || cache.length - offset > maxDoc) { > cache = new int[maxDoc]; > offset = 0; > } > > if (tp == null) { > tp = reader.termPositions(UID_TERM); > } else { > // if using Lucene 2.2 replace the following line with > // tp = reader.termPositions(UID_TERM); > tp.seek(UID_TERM); > } > > while (tp.next()) { > assert tp.doc() < maxDoc; > if (!reader.isDeleted(tp.doc())) { > tp.nextPosition(); > tp.getPayload(payloadBuffer, 0); > cache[tp.doc() + offset] = bytesToInt(payloadBuffer); > } > } > > return cache; > } > > /** Returns the UID for the passed-in docId. > * If you use this method in a Hitcollector and your Query contains > OR-terms, > * then try to set BooleanQuery.setAllowDocsOutOfOrder(false) to > improve performance. > */ > public int getUID(IndexReader reader, int docId) throws IOException { > if (tp == null) { > tp = reader.termPositions(UID_TERM); > } else if (tp.doc()> docId) { > // if using Lucene 2.2 replace the following line with > // tp = reader.termPositions(UID_TERM); > tp.seek(UID_TERM); > } > > tp.skipTo(docId); > tp.nextPosition(); > tp.getPayload(payloadBuffer, 0); > > return bytesToInt(payloadBuffer); > } > > /** > * Adds a unique docId (UID) to a document as a payload. > */ > public void addUID(Document doc, int uid) { > singlePayloadTokenStream.setUID(uid); > doc.add(new Field(UID_TERM.field(), singlePayloadTokenStream)); > } > > private int bytesToInt(byte[] bytes) { > return ((bytes[3] & 0xFF) << 24) | ((bytes[2] & 0xFF) << 16) > | ((bytes[1] & 0xFF) << 8) | (bytes[0] & 0xFF); > > } > > private static class SinglePayloadTokenStream extends TokenStream { > private Token token = new Token(UID_TERM.text(), 0, 0); > private byte[] buffer = new byte[4]; > private boolean returnToken = false; > > void setUID(int uid) { > buffer[0] = (byte) (uid); > buffer[1] = (byte) (uid >> 8); > buffer[2] = (byte) (uid >> 16); > buffer[3] = (byte) (uid >> 24); > token.setPayload(new Payload(buffer)); > returnToken = true; > } > > public Token next() throws IOException { > if (returnToken) { > returnToken = false; > return token; > } else { > return null; > } > } > } > } > > > > > Thanks > > > > -John > > > > On 10/19/07, Michael Busch <[EMAIL PROTECTED]> wrote: > >> John Wang wrote: > >>> I can tried to get some numbers for leading an int[] array vs > >>> FieldCache.getInts(). > >> I've had a similar performance problem when I used the FieldCache. The > >> loading performance is apparently so slow, because each value is stored > >> as a term in the dictionary. For loading the cache it is necessary to > >> iterate over all terms for the field in the dictionary. And for each > >> term it's posting list is opened to check which documents have that > value. > >> > >> If you store unique docIds, then there are no two documents that share > >> the same value. That means, that each value gets its own entry in the > >> dictionary and to load each value it is necessary to perform two random > >> I/O seeks (one for term lookup + one to open the posting list). > >> > >> In my app it took for a big index several minutes to fill the cache > like > >> that. > >> > >> To speed things up I did essentially what Ning suggested. Now I store > >> the values as payloads in the posting list of an artificial term. To > >> fill my cache it's only necessary to perform a couple of I/O seeks for > >> opening the posting list of the specific term, then it is just a > >> sequential scan to load all values. With this approach the time for > >> filling the cache went down from minutes to seconds! > >> > >> Now this approach is already much better than the current field cache > >> implementation, but it still can be improved. In fact, we already have > a > >> mechanism for doing that: the norms. Norms are stored with a fixed > size, > >> which means both random access and sequential scan are optimal. Norms > >> are also cached in memory, and filling that cache is much faster > >> compared to the current FieldCache approach. > >> > >> I was therefore thinking about adding per-document payloads to Lucene > >> (we can also call it document-metadata). The API could look like this: > >> > >> Document d = new Document(); > >> byte[] uidValue = ... > >> d.addMetadata("uid", uidValue); > >> > >> And on the retrieval side all values could either be loaded into the > >> field cache, or, if the index is too big, a new API can be used: > >> > >> IndexReader reader = IndexReader.open(...); > >> DocumentMetadataIterator it = reader.metadataIterator("uid"); > >> > >> where DocumentMetadataIterator is an interface similar to TermDocs: > >> > >> interface DocumentMetadataIterator { > >> void seek(String name); > >> boolean next(); > >> boolean skipTo(int doc); > >> > >> int doc(); > >> byte[] getMetadata(); > >> } > >> > >> The next question would be how to store the per-doc payloads (PDP). If > >> all values have the same length (as the unique docIds), then we should > >> store them as efficiently as possible, like the norms. However, we > still > >> want to offer the flexibility of having variable-length values. For > this > >> case we could use a new data structure similar to our posting list. > >> > >> PDPList --> FixedLengthPDPList | <VariableLengthPDPList, > >> SkipList> > >> FixedLengthPDPList --> <Payload>^SegSize > >> VariableLengthPDPList --> <DocDelta, PayloadLength?, Payload> > >> Payload --> Byte^PayloadLength > >> PayloadLength --> VInt > >> SkipList --> see frq.file > >> > >> Because we don't have global field semantics Lucene should > automatically > >> pick the "right" data structure. This could work like this: When the > >> DocumentsWriter writes a segment it checks whether all values of a PDP > >> have the same length. If yes, it stores them as FixedLengthPDPList, if > >> not, then as VariableLengthPDPList. > >> When the SegmentMerger merges two or more segments it checks if all > >> segments have a FixedLengthPDPList with the same length for a PDP. If > >> not, it writes a VariableLengthPDPList to the new segment. > >> > >> I think this would be a nice new feature for Lucene. We could then have > >> user-defined and Lucene-specific PDPs. For example, norms would be in > >> the latter category (this way we would get rid of the special code for > >> norms, as they could be handled as PDPs). It would also be easy to add > >> new features in the future, like splitting the norms into two values: a > >> norm and a boost value. > >> > >> OK lot's of thoughts, I'm sure I'll get lot's of comments too ... ;) > >> > >> - Michael > >> > >>> Thanks > >>> > >>> -John > >>> > >>> On 10/19/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > >>>> It seems like there are (at least) two angles here for getting better > >>>> performance from FieldCache: > >>>> > >>>> 1) Be incremental: with reopen() we should only have to update a > >>>> subset of the array in the FieldCache, according to the changed > >>>> segments. This is what Hoss is working on and Mark was > referring > >>>> to and I think it's very important! > >>>> > >>>> 2) Parsing is slow (?): I'm guessing one of the reasons that John > >>>> added the _X.udt file was because it's much faster to load an > >>>> array of already-parsed ints than to ask FieldCache to populate > >>>> itself. > >>>> > >>>> Even if we do #1, I think #2 could be a big win (in addition)? John > >>>> do you have any numbers of how much faster it is to load the array of > >>>> ints from the _X.udt file vs having FieldCache populate itself? > >>>> > >>>> Also on the original question of "can we open up SegmentReader, > >>>> FieldsWriter, etc.", I think that's a good idea? At least we can > make > >>>> things protected instead of private/final? > >>>> > >>>> Mike > >>>> > >>>> "Ning Li" <[EMAIL PROTECTED]> wrote: > >>>>> I see what you mean by 2) now. What Mark said should work for you > when > >>>>> it's done. > >>>>> > >>>>> Cheers, > >>>>> Ning > >>>>> > >>>>> On 10/18/07, John Wang <[EMAIL PROTECTED]> wrote: > >>>>>> Hi Ning: > >>>>>> That is essentially what field cache does. Doing this for each > >>>> docid in > >>>>>> the result set will be slow if the result set is large. But loading > >> it > >>>> in > >>>>>> memory when opening index can also be slow if the index is large > and > >>>> updates > >>>>>> often. > >>>>>> > >>>>>> Thanks > >>>>>> > >>>>>> -John > >>>>>> > >>>>>> On 10/18/07, Ning Li <[EMAIL PROTECTED]> wrote: > >>>>>>> Make all documents have a term, say "ID:UID", and for each > document, > >>>>>>> store its UID in the term's payload. You can read off this posting > >>>>>>> list to create your array. Will this work for you, John? > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Ning > >>>>>>> > >>>>>>> > >>>>>>> On 10/18/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: > >>>>>>>> Forwarding this to java-dev per request. Seems like the best > >>>> place > >>>>>>>> to discuss this topic. > >>>>>>>> > >>>>>>>> Erik > >>>>>>>> > >>>>>>>> > >>>>>>>> Begin forwarded message: > >>>>>>>> > >>>>>>>>> From: "John Wang" <[EMAIL PROTECTED]> > >>>>>>>>> Date: October 17, 2007 5:43:29 PM EDT > >>>>>>>>> To: [EMAIL PROTECTED] > >>>>>>>>> Subject: lucene indexing and merge process > >>>>>>>>> > >>>>>>>>> Hi Erik: > >>>>>>>>> > >>>>>>>>> We are revamping our search system here at LinekdIn. And we > >>>> are > >>>>>>>>> using Lucene. > >>>>>>>>> > >>>>>>>>> One issue we ran across is that we store an UID in Lucene > >>>> which > >>>>>>>>> we map to the DB storage. So given a docid, to lookup its UID, > >>>> we > >>>>>>>>> have the following solutions: > >>>>>>>>> > >>>>>>>>> 1) Index it as a Stored field and get it from reader.document > (very > >>>>>>>>> slow if recall is large) > >>>>>>>>> 2) Load/Warmup the FieldCache (for large corpus, loading up the > >>>>>>>>> indexreader can be slow) > >>>>>>>>> 3) construct it using the FieldCache and persist it on disk > >>>>>>>>> everytime the index changes. (not suitable for real time > >>>> indexing, > >>>>>>>>> e.g. this process will degrade as # of documents get large) > >>>>>>>>> > >>>>>>>>> None of the above solutions turn out to be adequate for our > >>>>>>>>> requirements. > >>>>>>>>> > >>>>>>>>> What we end up doing is to modify Lucene code by changing > >>>>>>>>> SegmentReader,DocumentWriter,and FieldWriter classes by taking > >>>>>>>>> advantage of the Lucene Segment/merge process. E.g: > >>>>>>>>> > >>>>>>>>> For each segment, we store a .udt file, which is an int[] > >>>>>>>>> array, (by changing the FieldWriter class) > >>>>>>>>> > >>>>>>>>> And SegmentReader will load the .udt file into an array. > >>>>>>>>> > >>>>>>>>> And merge happens seemlessly. > >>>>>>>>> > >>>>>>>>> Because the tight encapsulation around these classes, e.g. > >>>>>>>>> private and final methods, it is very difficult to extend Lucene > >>>>>>>>> while avoiding branch into our own version. Is there a way we > >>>> can > >>>>>>>>> open up and make these classes extensible? We'd be happy to > >>>>>>>>> contribute what we have done. > >>>>>>>>> > >>>>>>>>> I guess to tackle the problem from a different angle: is > >>>> there > >>>>>>>>> a way to incorporate FieldCache into the segments (it is > >>>> strictly > >>>>>>>>> in memory now), and build disk versions while indexing. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hope I am making sense. > >>>>>>>>> > >>>>>>>>> I did not send this out to the mailing list because I wasn't > >>>>>>>>> sure if this is a dev question or an user question, feel free to > >>>>>>>>> either forward it to the right mailing list or let me know and I > >>>>>>>>> can forward it. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks > >>>>>>>>> > >>>>>>>>> -John > >>>>>>>>> > >>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>>>>> > >>>>>>>> > >>>> --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>>>> > >>>>>>> > >>>>> > --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>> > >>>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >