[jira] Commented: (LUCENE-1292) Tag Index

Jason Rutherglen (JIRA) Tue, 22 Jul 2008 06:09:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615615#action_12615615
 ]


Jason Rutherglen commented on LUCENE-1292:
------------------------------------------

I decided to rework this to not use a transaction log.  The reason being the 
transaction log should be global and be per segment.  The other reason is the 
current architecture used the transaction log to read from.  Given the small 
size of postings, I decided it would be best to keep the changed postings 
blocks in memory.  This way there is no performance hit from the realtime 
update feature.  When the RAM usage hits a specified number then a new term + 
postings file can be written.  

One issue is how to implement skipTo over the blocks.  The postings blocks use 
the SkipListWriter implementation.  However the question is what is the best 
way to skip over the blocks themselves?  One approach is to use the modified 
binary search from InstantiatedIndex described at 
http://ochafik.free.fr/blog/?p=106.  The other is to implement a skip list over 
the blocks, however I am not sure SkipListWriter will work given it is tied to 
using file pointers.  If SkipListWriter were used, would it be best to have it 
skip over the blocks using only the maxDoc for each block?  This is the 
approach the binarysearch method would use. 

At the beginning of each term's list of posting blocks, there is a set of 
BlockInfos describing the blocks.  Some blocks are missing because they don't 
have postings.  Blocks are predetermined to span a document number range rather 
than a length in bytes range.  This is to not have to deal with document number 
overlap issues between blocks when an update occurs.  Right now I am thinking 
the default block size should be 4000 docs.  This should yield a block size in 
bytes of a little over 4000 at the maximum.  It is unclear how the size of the 
block will affect the performance of the skip process.  

> Tag Index
> ---------
>
>                 Key: LUCENE-1292
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1292
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>         Attachments: lucene-1292.patch
>
>
> The problem the tag index solves is slow field cache loading and range 
> queries, and reindexing an entire document to update fields that are not 
> tokenized.  
> The tag index holds untokenized terms with a docfreq of 1 in a term 
> dictionary like index file.  The file also stores the docs per term, similar 
> to LUCENE-1278.  The index also has a transaction log and in memory index for 
> realtime updates to the tags.  The transaction log is periodically merged 
> into the existing tag term dictionary index file.
> The TagIndexReader extends IndexReader and is unified with a regular index by 
> ParallelReader.  There is a doc id to terms skip pointer file for the 
> IndexReader.document method.  This file contains a pointer for looking up the 
> terms for a document.  
> There is a higher level class that encapsulates writing a document with tag 
> fields to IndexWriter and TagIndexWriter.  This requires a hook into 
> IndexWriter to coordinate doc ids and flushing segments to disk.  
> The writer class could be as simple as:
> {code}
> public class TagIndexWriter {
>   
>   public void add(Term term, DocIdSetIterator iterator) {
>   }
>   
>   public void delete(Term term, DocIdSetIterator iterator) {
>   }
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1292) Tag Index

Reply via email to