Sep codec writes insane amounts of skip data
--------------------------------------------

                 Key: LUCENE-2905
                 URL: https://issues.apache.org/jira/browse/LUCENE-2905
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Robert Muir
             Fix For: Bulk Postings branch


Currently, even if we use better compression algorithms via Fixed or Variable 
Intblock
encodings, we have problems with both performance and index size versus 
StandardCodec.

Consider the following numbers:

{noformat}
standard:
frq: 1,862,174,204 bytes
prx: 1,146,898,936 bytes
tib: 541,128,354 bytes
complete index: 4,321,032,720 bytes

bulkvint:
doc: 1,297,215,588 bytes
frq: 725,060,776 bytes
pos: 1,163,335,609 bytes
tib: 729,019,637 bytes
complete index: 5,180,088,695 bytes

simple64:
doc: 1,260,869,240 bytes
frq: 234,491,576 bytes
pos: 1,055,024,224 bytes
skp: 473,293,042 bytes
tib: 725,928,817 bytes
complete index: 4,520,488,986 bytes
{noformat}

I think there are several reasons for this:
* Splitting into separate files (e.g. postings into .doc + .freq). 
* Having to store both a relative delta to the block start, and an offset into 
the block.
* In a lot of cases various numbers involved are larger than they should be: 
e.g. they are file pointer deltas, but blocksize is fixed...

Here are some ideas (some are probably stupid) of things we could do to try to 
fix this:

Is Sep really necessary? Instead should we make an alternative to Sep, 
Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into one 
file? the concrete impl could implement skipBlock() for when they only want 
docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could just 
skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read their 
single numBytes header they already have today, and skip numBytes.

Isn't our skipInterval too low? Most of our codecs are using block sizes such 
as 64 or 128, so a skipInterval of 16 seems a little overkill.

Shouldn't skipInterval not even be a final constant in SegmentWriteState, but 
instead completely private to the codec?

For block codecs, doesn't it make sense for them to only support skipping to 
the start of a block? Then, their skip pointers dont need to be a combination 
of delta + upto, because upto is always zero. What would we have to modify in 
the bulkpostings api for jump() to work with this?

For block codecs, shouldn't skipInterval then be some sort of divisor, based on 
block size (maybe by default its 1, meaning we can skip to the start of a every 
block)

For codecs like Simple64 that encode fixed length frames, shouldnt we use 
'blockid' instead of file pointer so that we get smaller numbers? e.g. simple64 
can do blockid * 8 to get to the file pointer.

Going along with the blockid concept, couldnt pointers in the terms dict be 
blockid deltas from the index term, instead of fp deltas? This would be smaller 
numbers and we could compress this metadata better.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to