Michael McCandless created LUCENE-5609:
------------------------------------------
Summary: Should we revisit the default numeric precision step?
Key: LUCENE-5609
URL: https://issues.apache.org/jira/browse/LUCENE-5609
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Reporter: Michael McCandless
Fix For: 4.9, 5.0
Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
numeric fields, but this is a pretty big hit on indexing speed and
disk usage, especially for tiny documents, because it creates many (8
or 16) terms for each value.
Since we originally set these defaults, a lot has changed... e.g. we
now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
a faster postings format, etc.
Index size is important because it limits how much of the index will
be hot (fit in the OS's IO cache). And more apps are using Lucene for
tiny docs where the overhead of individual fields is sizable.
I used the Geonames corpus to run a simple benchmark (all sources are
committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
with these numeric fields:
* lat/lng (double)
* modified time, elevation, population (long)
* dem (int)
I tested 4, 8 and 16 precision steps:
{noformat}
indexing:
PrecStep Size IndexTime
4 1812.7 MB 651.4 sec
8 1203.0 MB 443.2 sec
16 894.3 MB 361.6 sec
searching:
Field PrecStep QueryTime TermCount
geoNameID 4 2872.5 ms 20306
geoNameID 8 2903.3 ms 104856
geoNameID 16 3371.9 ms 5871427
latitude 4 2160.1 ms 36805
latitude 8 2249.0 ms 240655
latitude 16 2725.9 ms 4649273
modified 4 2038.3 ms 13311
modified 8 2029.6 ms 58344
modified 16 2060.5 ms 77763
longitude 4 3468.5 ms 33818
longitude 8 3629.9 ms 214863
longitude 16 4060.9 ms 4532032
{noformat}
Index time is with 1 thread (for identical index structure).
The query time is time to run 100 random ranges for that field,
averaged over 20 iterations. TermCount is the total number of terms
the MTQ rewrote to across all 100 queries / segments, and it gets
higher as expected as precStep gets higher, but the search time is not
that heavily impacted ... negligible going from 4 to 8, and then some
impact from 8 to 16.
Maybe we should increase the int/float default precision step to 8 and
long/double to 16? Or both to 16?
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]