[ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990638#comment-12990638
 ] 

Robert Muir commented on LUCENE-2905:
-------------------------------------

As a quick experiment, i compared simple64-varint with the default skipInterval 
(16) against one with a higher interval (32).

Total index size decreased from 4,520,488,986 bytes to 4,269,022,166 bytes.

MMapDirectory:
||Query||QPS base||QPS patch||Pct diff||||
|unit~0.7|29.83|28.84|{color:red}-3.3%{color}|
|states|36.26|35.49|{color:red}-2.1%{color}|
|+united +states|12.02|11.80|{color:red}-1.9%{color}|
|un*d|17.37|17.24|{color:red}-0.7%{color}|
|uni*|16.46|16.36|{color:red}-0.6%{color}|
|u*d|9.27|9.22|{color:red}-0.5%{color}|
|unit*|28.43|28.49|{color:green}0.2%{color}|
|"united states"|8.19|8.25|{color:green}0.8%{color}|
|doctitle:.*[Uu]nited.*|4.00|4.04|{color:green}1.0%{color}|
|doctimesecnum:[10000 TO 60000]|10.10|10.25|{color:green}1.5%{color}|
|unit~0.5|17.52|17.82|{color:green}1.7%{color}|
|spanNear([unit, state], 10, true)|31.77|32.33|{color:green}1.7%{color}|
|united~0.6|7.96|8.20|{color:green}3.0%{color}|
|united~0.75|10.94|11.49|{color:green}5.0%{color}|
|"united states"~3|4.14|4.40|{color:green}6.2%{color}|
|united states|11.21|12.07|{color:green}7.7%{color}|
|spanFirst(unit, 5)|107.66|116.32|{color:green}8.0%{color}|
|+nebraska +states|111.04|120.41|{color:green}8.4%{color}|


> Sep codec writes insane amounts of skip data
> --------------------------------------------
>
>                 Key: LUCENE-2905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2905
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Bulk Postings branch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable 
> Intblock
> encodings, we have problems with both performance and index size versus 
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset 
> into the block.
> * In a lot of cases various numbers involved are larger than they should be: 
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try 
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, 
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into 
> one file? the concrete impl could implement skipBlock() for when they only 
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could 
> just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read 
> their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such 
> as 64 or 128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but 
> instead completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to 
> the start of a block? Then, their skip pointers dont need to be a combination 
> of delta + upto, because upto is always zero. What would we have to modify in 
> the bulkpostings api for jump() to work with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based 
> on block size (maybe by default its 1, meaning we can skip to the start of a 
> every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 
> 'blockid' instead of file pointer so that we get smaller numbers? e.g. 
> simple64 can do blockid * 8 to get to the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be 
> blockid deltas from the index term, instead of fp deltas? This would be 
> smaller numbers and we could compress this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to