[ https://issues.apache.org/jira/browse/LUCENE-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599609#action_12599609 ]
Jason Rutherglen commented on LUCENE-1292: ------------------------------------------ Terms that have many docs will store the docs + skiplist in blocks. This is to avoid having to write a large kilobyte docs + skiplist for an update that only alters some of the docs. Only the blocks that will be changing will be updated. They will be appended to the transaction log and the in memory file pointers updated. When this transaction log reaches a certain percentage of the size of the existing tag.tii file the whole tag.tii file will be rewritten. When an iteration of TermEnum is being performed, the in memory alterations are consulted. If the a term for example no longer has any docs, the term is skipped. The TermDocs iteration performs the same by checking if it should be reading from the tag.tii or the tag.tlg file for the current block. The block skipto and iteration code is functions the same as MultiTermDocs. The concern is the optimal number of blocks per term and the affect on skipto performance. Because only 2 files are involved it seems that the switching between files that may be an issue with MultiTermDocs skipto over many segments should not be an issue. Seeks in the same file are faster than seeks over multiple files. tag.tii TermInfos --> <TermInfo> TermCount> TagTermInfo --> <Term, DocFreq, NumBlocks> Term --> <PrefixLength, Suffix, FieldNum, TermNumber> BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc> Block --> <DocDeltas,SkipData> tag.tlg Term --> <TermString> BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc> Block --> <DocDeltas,SkipData> > Tag Index > --------- > > Key: LUCENE-1292 > URL: https://issues.apache.org/jira/browse/LUCENE-1292 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.3.1 > Reporter: Jason Rutherglen > > The problem the tag index solves is slow field cache loading and range > queries, and reindexing an entire document to update fields that are not > tokenized. > The tag index holds untokenized terms with a docfreq of 1 in a term > dictionary like index file. The file also stores the docs per term, similar > to LUCENE-1278. The index also has a transaction log and in memory index for > realtime updates to the tags. The transaction log is periodically merged > into the existing tag term dictionary index file. > The TagIndexReader extends IndexReader and is unified with a regular index by > ParallelReader. There is a doc id to terms skip pointer file for the > IndexReader.document method. This file contains a pointer for looking up the > terms for a document. > There is a higher level class that encapsulates writing a document with tag > fields to IndexWriter and TagIndexWriter. This requires a hook into > IndexWriter to coordinate doc ids and flushing segments to disk. > The writer class could be as simple as: > {code} > public class TagIndexWriter { > > public void add(Term term, DocIdSetIterator iterator) { > } > > public void delete(Term term, DocIdSetIterator iterator) { > } > } > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]