[ 
https://issues.apache.org/jira/browse/LUCENE-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5647:
--------------------------------
    Attachment: LUCENE-5647.patch

initial patch.

if you look at the current logic, it tries to optimize this, but doesnt have 
any safety switch. it also doesn't always work: it sometimes falls back to the 
default algorithm (even when there are no deletions). This is because of this 
conditional logic in trunk, comments mine:
{code}
// we currently never bulk merge the last chunk.
// if this one is not "full" and leaves pending docs, we never resync.
if (docBase + chunkDocs < maxDoc) ...
{code}

 When there are deletions it will always fall back and never resync, so 
handling that case currently doesn't achieve anything except complexity.

The falling out of sync it does impacts smaller vectors docs more because 
default algorithm is slow (no getMergeInstance).  For big documents (size > 
chunkSize), you also dodge the resync problem above.

I replaced it with the same algorithm from stored fields.

If turned on vectors (5 fields), indexing, and store with small documents (10 
fields):

{noformat}
trunk:
timeIndexing=601139
timeMerging=49046

SM 0 [2015-01-17 03:26:30.208; main]: 3765 msec to merge vectors [9490360 docs]
SM 0 [2015-01-17 03:26:04.928; main]: 5508 msec to merge vectors [7300730 docs]
SM 0 [2015-01-17 03:25:43.179; main]: 1261 msec to merge vectors [189430 docs]

patch:
timeIndexing=422731
timeMerging=43832

SM 0 [2015-01-17 03:37:15.480; main]: 2183 msec to merge vectors [9490360 docs]
SM 0 [2015-01-17 03:36:50.698; main]: 1492 msec to merge vectors [7300730 docs]
SM 0 [2015-01-17 03:36:32.620; main]: 27 msec to merge vectors [189430 docs]
{noformat}

You can see the forceMerge time is not so much better, because 2/3 of the 
collection is in the first segment, but overall indexing is improved because it 
impacts other merges (see the 189430 one)


Anyway I think its better overall, at least for simplicity and additional 
safety mechanism. 

> disable current term vectors bulk merge
> ---------------------------------------
>
>                 Key: LUCENE-5647
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5647
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.9, Trunk
>
>         Attachments: LUCENE-5647.patch
>
>
> See LUCENE-5646 for the motivation.
> Long term it might be nice to add algorithm #2 to term vectors if its 
> possible and not too complex.
> But for now, I think we should avoid such rare optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to