[ 
https://issues.apache.org/jira/browse/LUCENE-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-5797:
--------------------------------

    Attachment: LUCENE-5797.patch

Attached is a patch that speeds it up, but I'm not happy with the complexity.

I benchmarked by indexing geonames with every field as indexed TextField with 
norms (160 segments), then timed merging all of these:

{noformat}
SM 0 [Mon Jun 30 14:19:47 EDT 2014; Lucene Merge Thread #1]: 1533 msec to merge 
norms [1340190 docs]
SM 0 [Mon Jun 30 14:19:59 EDT 2014; Lucene Merge Thread #1]: 1603 msec to merge 
norms [1509620 docs]
SM 0 [Mon Jun 30 14:20:11 EDT 2014; Lucene Merge Thread #0]: 2432 msec to merge 
norms [1380799 docs]
SM 0 [Mon Jun 30 14:20:13 EDT 2014; Lucene Merge Thread #1]: 3043 msec to merge 
norms [1601868 docs]
SM 0 [Mon Jun 30 14:20:25 EDT 2014; Lucene Merge Thread #0]: 1785 msec to merge 
norms [1819675 docs]
SM 0 [Mon Jun 30 14:21:19 EDT 2014; Lucene Merge Thread #0]: 8900 msec to merge 
norms [8330469 docs]
{noformat}

{noformat}
SM 0 [Mon Jun 30 14:22:15 EDT 2014; Lucene Merge Thread #1]: 1119 msec to merge 
norms [1340190 docs]
SM 0 [Mon Jun 30 14:22:26 EDT 2014; Lucene Merge Thread #1]: 1214 msec to merge 
norms [1509620 docs]
SM 0 [Mon Jun 30 14:22:37 EDT 2014; Lucene Merge Thread #0]: 1110 msec to merge 
norms [1380799 docs]
SM 0 [Mon Jun 30 14:22:38 EDT 2014; Lucene Merge Thread #1]: 1284 msec to merge 
norms [1601868 docs]
SM 0 [Mon Jun 30 14:22:49 EDT 2014; Lucene Merge Thread #0]: 1335 msec to merge 
norms [1819675 docs]
SM 0 [Mon Jun 30 14:23:41 EDT 2014; Lucene Merge Thread #0]: 6834 msec to merge 
norms [8330469 docs]
{noformat}

Comparing the other values (e.g. time to merge postings/stored fields) between 
the two runs there wasn't much noise, so I think removing all the hashing helps.

> improve speed of norms merging
> ------------------------------
>
>                 Key: LUCENE-5797
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5797
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-5797.patch
>
>
> Today we use the following procedure:
> * track HashSet<Long> uniqueValues, until it exceeds 256 unique values.
> * convert to array, sort and assign ordinals to each one
> * create encoder map (HashMap<Long,Integer>) to encode each value.
> This results in each value being hashed twice... but the vast majority of the 
> time people will just be using single-byte norms and a simple array is enough 
> for that range.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to