[ 
https://issues.apache.org/jira/browse/LUCENE-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176554#comment-17176554
 ] 

Michael McCandless commented on LUCENE-9447:
--------------------------------------------

+1 to simply switch to bigger default block size (256 KB seems good) for now.  
At least for this particular corpus, the reduction is massive (~33%).

It would be nice if we could auto-adapt the block size based on how 
compressible the stored fields really are, dynamically tuning the index size vs 
CPU cost of doc retrieval, but that can come later.

But, do we have any benchmarks that measure the CPU impact to retrieving stored 
fields?  That is the downside of compressing bigger blocks, right?  Higher 
per-hit decode cost (if the hit is in a new block).

E.g. Lucene's facet implementation relies on this, since resolving its int 
ordinals to human friendly facet labels is done by loading a document for each 
ordinal.  [~gworah] is working on switching to doc values in LUCENE-9450 to 
reduce this cost.

Sharing the compression dictionary across blocks would be amazing, but that is 
surely complex, and would indeed likely reduce how often we could bulk-copy 
compressed blocks during merging.  But, maybe that is OK?  Increasing indexing 
cost in order to get a smaller index is often a good tradeoff?  Does {{zlib}} 
maybe support merging dictionaries / quickly re-writing a previously compressed 
output based on a new dictionary?  Maybe we (later!) could switch to a 
different implementation that would offer such "expert" APIs?

> Make BEST_COMPRESSION compress more aggressively?
> -------------------------------------------------
>
>                 Key: LUCENE-9447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9447
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> The Lucene86 codec supports setting a "Mode" for stored fields compression, 
> that is either "BEST_SPEED", which translates to blocks of 16kB or 128 
> documents (whichever is hit first) compressed with LZ4, or 
> "BEST_COMPRESSION", which translates to blocks of 60kB or 512 documents 
> compressed with DEFLATE with default compression level (6).
> After looking at indices that spent most disk space on stored fields 
> recently, I noticed that there was quite some room for improvement by 
> increasing the block size even further:
> ||Block size||Stored fields size||
> |60kB|168412338|
> |128kB|130813639|
> |256kB|113587009|
> |512kB|104776378|
> |1MB|100367095|
> |2MB|98152464|
> |4MB|97034425|
> |8MB|96478746|
> For this specific dataset, I had 1M documents that each had about 2kB of 
> stored fields each and quite some redundancy.
> This makes me want to look into bumping this block size to maybe 256kB. It 
> would be interesting to re-do the experiments we did on LUCENE-6100 to see 
> how this affects the merging speed. That said I don't think it would be 
> terrible if the merging time increased a bit given that we already offer the 
> BEST_SPEED option for CPU-savvy users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to