[ 
https://issues.apache.org/jira/browse/LUCENE-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599609#action_12599609
 ] 

Jason Rutherglen commented on LUCENE-1292:
------------------------------------------

Terms that have many docs will store the docs + skiplist in blocks.  This is to 
avoid having to write a large kilobyte docs + skiplist for an update that only 
alters some of the docs.  Only the blocks that will be changing will be 
updated.  They will be appended to the transaction log and the in memory file 
pointers updated.  When this transaction log reaches a certain percentage of 
the size of the existing tag.tii file the whole tag.tii file will be rewritten.

When an iteration of TermEnum is being performed, the in memory alterations are 
consulted.  If the a term for example no longer has any docs, the term is 
skipped.  The TermDocs iteration performs the same by checking if it should be 
reading from the tag.tii or the tag.tlg file for the current block.  The block 
skipto and iteration code is functions the same as MultiTermDocs.

The concern is the optimal number of blocks per term and the affect on skipto 
performance.  Because only 2 files are involved it seems that the switching 
between files that may be an issue with MultiTermDocs skipto over many segments 
should not be an issue.  Seeks in the same file are faster than seeks over 
multiple files.  

tag.tii
TermInfos -->  <TermInfo>  TermCount>
TagTermInfo --> <Term, DocFreq, NumBlocks>
Term --> <PrefixLength, Suffix, FieldNum, TermNumber>
BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc>
Block --> <DocDeltas,SkipData>

tag.tlg

Term --> <TermString>
BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc>
Block --> <DocDeltas,SkipData>

> Tag Index
> ---------
>
>                 Key: LUCENE-1292
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1292
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>
> The problem the tag index solves is slow field cache loading and range 
> queries, and reindexing an entire document to update fields that are not 
> tokenized.  
> The tag index holds untokenized terms with a docfreq of 1 in a term 
> dictionary like index file.  The file also stores the docs per term, similar 
> to LUCENE-1278.  The index also has a transaction log and in memory index for 
> realtime updates to the tags.  The transaction log is periodically merged 
> into the existing tag term dictionary index file.
> The TagIndexReader extends IndexReader and is unified with a regular index by 
> ParallelReader.  There is a doc id to terms skip pointer file for the 
> IndexReader.document method.  This file contains a pointer for looking up the 
> terms for a document.  
> There is a higher level class that encapsulates writing a document with tag 
> fields to IndexWriter and TagIndexWriter.  This requires a hook into 
> IndexWriter to coordinate doc ids and flushing segments to disk.  
> The writer class could be as simple as:
> {code}
> public class TagIndexWriter {
>   
>   public void add(Term term, DocIdSetIterator iterator) {
>   }
>   
>   public void delete(Term term, DocIdSetIterator iterator) {
>   }
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to