[jira] [Commented] (LUCENE-5646) stored fields bulk merging doesn't quite work right

Robert Muir (JIRA) Mon, 05 May 2014 21:07:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990259#comment-13990259
 ]


Robert Muir commented on LUCENE-5646:
-------------------------------------

perhaps the reason i fall "out of sync" is because the first segment ended on a 
non-chunk boundary (i have no deletions). 

So when it moves to the next segment, it falls out of sync and never 
"recovers". I'm not sure what we can do here: but it seems unless you have very 
large docs, you aren't gonna get a "pure-bulk copy" even with my fix, because 
the chances of everything aligning is quite rare.

Maybe there is a way we could (temporarily for that marge) force flush() at the 
end of segment transitions to avoid this, so that the optimization would 
continue, if we could then recombine them in the next merge to eventually 
recover?

> stored fields bulk merging doesn't quite work right
> ---------------------------------------------------
>
>                 Key: LUCENE-5646
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5646
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.9, 5.0
>
>
> from doing some profiling of merging:
> CompressingStoredFieldsWriter has 3 codepaths (as i see it):
> 1. optimized bulk copy (no deletions in chunk). In this case compressed data 
> is copied over.
> 2. semi-optimized copy: in this case its optimized for an existing 
> storedfieldswriter, and it decompresses and recompresses doc-at-a-time around 
> any deleted docs in the chunk.
> 3. ordinary merging
> In my dataset, i only see #2 happening, never #1. The logic for determining 
> if we can do #1 seems to be:
> {code}
> onChunkBoundary && chunkSmallEnough && chunkLargeEnough && noDeletions
> {code}
> I think the logic for "chunkLargeEnough" is out of sync with the 
> MAX_DOCS_PER_CHUNK limit? e.g. instead of:
> {code}
> startOffsets[it.chunkDocs - 1] + it.lengths[it.chunkDocs - 1] >= chunkSize // 
> chunk is large enough
> {code}
> it should be something like:
> {code}
> (it.chunkDocs >= MAX_DOCUMENTS_PER_CHUNK || startOffsets[it.chunkDocs - 1] + 
> it.lengths[it.chunkDocs - 1] >= chunkSize) // chunk is large enough
> {code}
> But this only works "at first" then falls out of sync in my tests. Once this 
> happens, it never reverts back to #1 algorithm and sticks with #2. So its 
> still not quite right.
> Maybe [~jpountz] knows off the top of his head...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5646) stored fields bulk merging doesn't quite work right

Reply via email to