[ 
https://issues.apache.org/jira/browse/LUCENE-9827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17305712#comment-17305712
 ] 

Robert Muir commented on LUCENE-9827:
-------------------------------------

The problem for small segments is that we still do wasted work today: we'll 
recompress a segment with 2 dirty chunks, only to create a new dirty chunk. The 
change above doesn't alleviate the pain because we end up needlessly 
recompressing.

So I think other things being equal, we shouldn't force recompression unless we 
can make a *clean chunk*, this means definite permanent progress and less 
wasted work. There are two parts to the change:

1. Fix "numDirtyDocs" to really be exact number of suboptimally compressed 
documents:
{noformat}
   public void finish(FieldInfos fis, int numDocs) throws IOException {
     if (numBufferedDocs > 0) {
       numDirtyChunks++; // incomplete: we had to force this flush
-      final long expectedChunkDocs =
-          Math.min(
-              maxDocsPerChunk, (long) ((double) chunkSize / 
bufferedDocs.size() * numBufferedDocs));
-      numDirtyDocs += expectedChunkDocs - numBufferedDocs;
+      numDirtyDocs += numBufferedDocs;
       flush();
     } else {
       assert bufferedDocs.size() == 0;
{noformat}

2. Fix lower bound to only recompress if there is at least {{maxDocsPerChunk}} 
dirty documents. By definition, this implies the previous getNumDirtyChunks() > 
1, but it also guarantees we'll produce at least one clean chunk.
{noformat}
   boolean tooDirty(Lucene90CompressingStoredFieldsReader candidate) {
     // more than 1% dirty, or more than hard limit of 1024 dirty chunks
     return candidate.getNumDirtyChunks() > 1024
-        || (candidate.getNumDirtyChunks() > 1
+        || (candidate.getNumDirtyDocs() > maxDocsPerChunk
             && candidate.getNumDirtyDocs() * 100 > candidate.getNumDocs());
   }
{noformat}

Speedup is 10x for lz4 and 23x for deflate:

Index first 100k docs, flush on every doc:
||compression||branch||index time||index size (snapshot)||
|lz4|trunk|737.7s|10,540kb|
|lz4|patch|73.3s|10,612kb|
|deflate|trunk|1923.0s|6,716kb|
|deflate|patch|83.7s|7,096kb|

Keep in mind size is just an arbitrary snapshot without force merge or 
anything, just showing it stays in bounds. The size impact on the is bounded by 
{{sizeof(doc) * maxDocsPerChunk * numSegments}}. For refernece, maxDocsPerChunk 
defaults to these values (BEST_SPEED: 1024, BEST_COMPRESSION: 4096).




> Small segments are slower to merge due to stored fields since 8.7
> -----------------------------------------------------------------
>
>                 Key: LUCENE-9827
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9827
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: Indexer.java, log-and-lucene-9827.patch, 
> merge-count-by-num-docs.png, merge-type-by-version.png, 
> total-merge-time-by-num-docs-on-small-segments.png, 
> total-merge-time-by-num-docs.png
>
>
> [~dm] and [~dimitrisli] looked into an interesting case where indexing slowed 
> down after upgrading to 8.7. After digging we identified that this was due to 
> the merging of stored fields, which had become slower on average.
> This is due to changes to stored fields, which now have top-level blocks that 
> are then split into sub-blocks and compressed using shared dictionaries (one 
> dictionary per top-level block). As the top-level blocks are larger than they 
> were before, segments are more likely to be considered "dirty" by the merging 
> logic. Dirty segments are segments were 1% of the data or more consists of 
> incomplete blocks. For large segments, the size of blocks doesn't really 
> affect the dirtiness of segments: if you flush a segment that has 100 blocks 
> or more, it will never be considered dirty as only the last block may be 
> incomplete. But for small segments it does: for instance if your segment is 
> only 10 blocks, it is very likely considered dirty given that the last block 
> is always incomplete. And the fact that we increased the top-level block size 
> means that segments that used to be considered clean might now be considered 
> dirty.
> And indeed benchmarks reported that while large stored fields merges became 
> slightly faster after upgrading to 8.7, the smaller merges actually became 
> slower. See attached chart, which gives the total merge time as a function of 
> the number of documents in the segment.
> I don't know how we can address this, this is a natural consequence of the 
> larger block size, which is needed to achieve better compression ratios. But 
> I wanted to open an issue about it in case someone has a bright idea how we 
> could make things better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to