[jira] Commented: (LUCENE-1292) Tag Index

Christopher Morris (JIRA) Fri, 06 Jun 2008 02:19:10 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602964#action_12602964
 ]


Christopher Morris commented on LUCENE-1292:
--------------------------------------------

The dynamic index is an ordinary Lucene index, wrapped to resemble a dynamic 
index.

Each modification to a dynamic term creates a document. The document has two 
fields: one is the dynamic term field, the other is "PK" post-pended by the 
dynamic term field. The "PK" field contains the primary key post-pended with 
the term. The dynamic term field contains the dynamic term text post-pended by 
either ADD or DEL with term position representing the primary key. There can be 
multiple additions and deletions in the same document.

The indexReader.docFreq() for a dynamic term is the sum of the termDocs freq 
for dynamic term ADD minus the sum of the dynamic term DEL. terms() is the 
underlying terms() for all fields not starting "PK", filtered by whether the 
dynamic term still exists (docFreq()>0). Retreiving terms for primary key/field 
combination involves the TermEnum for all terms with field ("PK" + field) 
starting with text (primary key). Terms with an odd docFreq() still exist (been 
added more times than deleted). Term Docs involves using TermPositions for ADD 
and DEL to seek through the index toggling the primary keys as exist/not exist.

To test performance I used the Enron corpus (~ 500,000 docs) that has a folder 
structure (3503 nodes, max depth ~6). Ran queries for each level in the 
hierachy (PrefixQuery) and saved the results as a dynamic term.

The results for a TermQuery search for the dynamic term compared to the 
original query varied from identical to four times slower, in a shark's tooth 
pattern with a frequency of 125 querys. The shark's tooth pattern does not 
match folder depth (cause of shark's tooth is currently unknown).

I am currently running a similar test for dynamic terms that have been dynamic. 
As above, but all nodes are set to the results for the first node, then all but 
the first are set to the value of the second, etc. The last node will have been 
modified 3503 times. Modifying this amount of data is slow.

I should be able to release the code if you wanted a direct comparison. The 
external APIs are similar: startBulkLoad(), addTerm(term, primary key), 
deleteTerm(term,primary key), acceptBulkLoad().


> Tag Index
> ---------
>
>                 Key: LUCENE-1292
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1292
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
>
> The problem the tag index solves is slow field cache loading and range 
> queries, and reindexing an entire document to update fields that are not 
> tokenized.  
> The tag index holds untokenized terms with a docfreq of 1 in a term 
> dictionary like index file.  The file also stores the docs per term, similar 
> to LUCENE-1278.  The index also has a transaction log and in memory index for 
> realtime updates to the tags.  The transaction log is periodically merged 
> into the existing tag term dictionary index file.
> The TagIndexReader extends IndexReader and is unified with a regular index by 
> ParallelReader.  There is a doc id to terms skip pointer file for the 
> IndexReader.document method.  This file contains a pointer for looking up the 
> terms for a document.  
> There is a higher level class that encapsulates writing a document with tag 
> fields to IndexWriter and TagIndexWriter.  This requires a hook into 
> IndexWriter to coordinate doc ids and flushing segments to disk.  
> The writer class could be as simple as:
> {code}
> public class TagIndexWriter {
>   
>   public void add(Term term, DocIdSetIterator iterator) {
>   }
>   
>   public void delete(Term term, DocIdSetIterator iterator) {
>   }
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1292) Tag Index

Reply via email to