[jira] [Created] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

Han Jiang (JIRA) Fri, 16 Aug 2013 07:11:40 -0700

Han Jiang created LUCENE-5179:
---------------------------------

             Summary: Refactoring on PostingsWriterBase for delta-encoding
                 Key: LUCENE-5179
                 URL: https://issues.apache.org/jira/browse/LUCENE-5179
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Han Jiang
            Assignee: Han Jiang
             Fix For: 5.0, 4.5



A further step from LUCENE-5029.

The short story is, previous API change brings two problems:
* it somewhat breaks backward compatibility: although we can still read old 
format,
  we can no longer reproduce it;
* pulsing codec have problem with it.

And long story...

With the change, current PostingsBase API will be like this:

* term dict tells PBF we start a new term (via startTerm());
* PBF adds docs, positions and other postings data;
* term dict tells PBF all the data for current term is completed (via 
finishTerm()),
  then PBF returns the metadata for current term (as long[] and byte[]);
* term dict might buffer all the metadata in an ArrayList. when all the term is 
collected,
  it then decides how those metadata will be located on disk.

So after the API change, PBF no longer have that annoying 'flushTermBlock', and 
instead
term dict maintains the <term, metadata> list.

However, for each term we'll now write long[] blob before byte[], so the index 
format is not consistent with pre-4.5.
like in Lucne41, the metadata can be written as longA,bytesA,longB, but now we 
have to write as longA,longB,bytesA.

Another problem is, pulsing codec cannot tell wrapped PBF how the metadata is 
delta-encoded, after all
PulsingPostingsWriter is only a PBF.

For example, we have terms=["a", "a1", "a2", "b", "b1" "b2"] and 
itemsInBlock=2, so theoretically
we'll finally have three blocks in BTTR: ["a" "b"]  ["a1" "a2"]  ["b1" "b2"], 
with this
approach, the metadata of term "b" is delta encoded base on metadata of "a". 
but when term dict tells
PBF to finishTerm("b"), it might silly do the delta encode base on term "a2".

So I think maybe we can introduce a method 'encodeTerm(long[], DataOutput out, 
FieldInfo, TermState, boolean absolute)',
so that during metadata flush, we can control how current term is written? And 
the term dict will buffer TermState, which
implicitly holds metadata like we do in PBReader side.

For example, if we want to reproduce old lucene41 format , we can simple set 
longsSize==0, then PBF
writes the old format (longA,bytesA,longB) to DataOutput, and the compatible 
issue is solved.
For pulsing codec, it will also be able to tell lower level how to encode 
metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-5179) Refactoring on PostingsWriterBase for delta-encoding

Reply via email to