[
https://issues.apache.org/jira/browse/LUCENE-5797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-5797:
--------------------------------
Attachment: LUCENE-5797.patch
Attached is a patch that speeds it up, but I'm not happy with the complexity.
I benchmarked by indexing geonames with every field as indexed TextField with
norms (160 segments), then timed merging all of these:
{noformat}
SM 0 [Mon Jun 30 14:19:47 EDT 2014; Lucene Merge Thread #1]: 1533 msec to merge
norms [1340190 docs]
SM 0 [Mon Jun 30 14:19:59 EDT 2014; Lucene Merge Thread #1]: 1603 msec to merge
norms [1509620 docs]
SM 0 [Mon Jun 30 14:20:11 EDT 2014; Lucene Merge Thread #0]: 2432 msec to merge
norms [1380799 docs]
SM 0 [Mon Jun 30 14:20:13 EDT 2014; Lucene Merge Thread #1]: 3043 msec to merge
norms [1601868 docs]
SM 0 [Mon Jun 30 14:20:25 EDT 2014; Lucene Merge Thread #0]: 1785 msec to merge
norms [1819675 docs]
SM 0 [Mon Jun 30 14:21:19 EDT 2014; Lucene Merge Thread #0]: 8900 msec to merge
norms [8330469 docs]
{noformat}
{noformat}
SM 0 [Mon Jun 30 14:22:15 EDT 2014; Lucene Merge Thread #1]: 1119 msec to merge
norms [1340190 docs]
SM 0 [Mon Jun 30 14:22:26 EDT 2014; Lucene Merge Thread #1]: 1214 msec to merge
norms [1509620 docs]
SM 0 [Mon Jun 30 14:22:37 EDT 2014; Lucene Merge Thread #0]: 1110 msec to merge
norms [1380799 docs]
SM 0 [Mon Jun 30 14:22:38 EDT 2014; Lucene Merge Thread #1]: 1284 msec to merge
norms [1601868 docs]
SM 0 [Mon Jun 30 14:22:49 EDT 2014; Lucene Merge Thread #0]: 1335 msec to merge
norms [1819675 docs]
SM 0 [Mon Jun 30 14:23:41 EDT 2014; Lucene Merge Thread #0]: 6834 msec to merge
norms [8330469 docs]
{noformat}
Comparing the other values (e.g. time to merge postings/stored fields) between
the two runs there wasn't much noise, so I think removing all the hashing helps.
> improve speed of norms merging
> ------------------------------
>
> Key: LUCENE-5797
> URL: https://issues.apache.org/jira/browse/LUCENE-5797
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Attachments: LUCENE-5797.patch
>
>
> Today we use the following procedure:
> * track HashSet<Long> uniqueValues, until it exceeds 256 unique values.
> * convert to array, sort and assign ordinals to each one
> * create encoder map (HashMap<Long,Integer>) to encode each value.
> This results in each value being hashed twice... but the vast majority of the
> time people will just be using single-byte norms and a simple array is enough
> for that range.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]