[jira] Commented: (LUCENE-2905) Sep codec writes insane amounts of skip data

Robert Muir (JIRA) Fri, 04 Feb 2011 09:38:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990642#comment-12990642
 ]


Robert Muir commented on LUCENE-2905:
-------------------------------------

Here's SimpleFSDirectory:

||Query||QPS base||QPS patch||Pct diff||||
|doctimesecnum:[10000 TO 60000]|8.93|8.80|{color:red}-1.4%{color}|
|states|31.27|31.19|{color:red}-0.3%{color}|
|spanNear([unit, state], 10, true)|23.34|23.34|{color:red}-0.0%{color}|
|unit*|25.77|25.84|{color:green}0.3%{color}|
|unit~0.7|14.18|14.31|{color:green}0.9%{color}|
|uni*|14.21|14.38|{color:green}1.3%{color}|
|"united states"|6.53|6.64|{color:green}1.6%{color}|
|unit~0.5|8.19|8.37|{color:green}2.2%{color}|
|un*d|13.12|13.44|{color:green}2.4%{color}|
|united~0.6|4.34|4.46|{color:green}2.8%{color}|
|u*d|5.88|6.05|{color:green}2.9%{color}|
|+united +states|10.17|10.47|{color:green}2.9%{color}|
|"united states"~3|3.77|3.89|{color:green}3.1%{color}|
|united~0.75|6.95|7.23|{color:green}4.0%{color}|
|doctitle:.*[Uu]nited.*|2.33|2.47|{color:green}6.0%{color}|
|spanFirst(unit, 5)|91.85|98.12|{color:green}6.8%{color}|
|united states|9.91|10.61|{color:green}7.0%{color}|
|+nebraska +states|63.09|72.34|{color:green}14.7%{color}|


> Sep codec writes insane amounts of skip data
> --------------------------------------------
>
>                 Key: LUCENE-2905
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2905
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Bulk Postings branch
>
>
> Currently, even if we use better compression algorithms via Fixed or Variable 
> Intblock
> encodings, we have problems with both performance and index size versus 
> StandardCodec.
> Consider the following numbers:
> {noformat}
> standard:
> frq: 1,862,174,204 bytes
> prx: 1,146,898,936 bytes
> tib: 541,128,354 bytes
> complete index: 4,321,032,720 bytes
> bulkvint:
> doc: 1,297,215,588 bytes
> frq: 725,060,776 bytes
> pos: 1,163,335,609 bytes
> tib: 729,019,637 bytes
> complete index: 5,180,088,695 bytes
> simple64:
> doc: 1,260,869,240 bytes
> frq: 234,491,576 bytes
> pos: 1,055,024,224 bytes
> skp: 473,293,042 bytes
> tib: 725,928,817 bytes
> complete index: 4,520,488,986 bytes
> {noformat}
> I think there are several reasons for this:
> * Splitting into separate files (e.g. postings into .doc + .freq). 
> * Having to store both a relative delta to the block start, and an offset 
> into the block.
> * In a lot of cases various numbers involved are larger than they should be: 
> e.g. they are file pointer deltas, but blocksize is fixed...
> Here are some ideas (some are probably stupid) of things we could do to try 
> to fix this:
> Is Sep really necessary? Instead should we make an alternative to Sep, 
> Interleaved? that interleaves doc and freq blocks (doc,freq,doc,freq) into 
> one file? the concrete impl could implement skipBlock() for when they only 
> want docdeltas: e.g. for Simple64 blocks on disk are fixed size so it could 
> just skip N bytes. Fixed Int Block codecs like PFOR and BulkVint just read 
> their single numBytes header they already have today, and skip numBytes.
> Isn't our skipInterval too low? Most of our codecs are using block sizes such 
> as 64 or 128, so a skipInterval of 16 seems a little overkill.
> Shouldn't skipInterval not even be a final constant in SegmentWriteState, but 
> instead completely private to the codec?
> For block codecs, doesn't it make sense for them to only support skipping to 
> the start of a block? Then, their skip pointers dont need to be a combination 
> of delta + upto, because upto is always zero. What would we have to modify in 
> the bulkpostings api for jump() to work with this?
> For block codecs, shouldn't skipInterval then be some sort of divisor, based 
> on block size (maybe by default its 1, meaning we can skip to the start of a 
> every block)
> For codecs like Simple64 that encode fixed length frames, shouldnt we use 
> 'blockid' instead of file pointer so that we get smaller numbers? e.g. 
> simple64 can do blockid * 8 to get to the file pointer.
> Going along with the blockid concept, couldnt pointers in the terms dict be 
> blockid deltas from the index term, instead of fp deltas? This would be 
> smaller numbers and we could compress this metadata better.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2905) Sep codec writes insane amounts of skip data

Reply via email to