[jira] [Created] (LUCENE-5609) Should we revisit the default numeric precision step?

Michael McCandless (JIRA) Wed, 16 Apr 2014 12:54:56 -0700

Michael McCandless created LUCENE-5609:
------------------------------------------


             Summary: Should we revisit the default numeric precision step?
                 Key: LUCENE-5609
                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/search
            Reporter: Michael McCandless
             Fix For: 4.9, 5.0


Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
numeric fields, but this is a pretty big hit on indexing speed and
disk usage, especially for tiny documents, because it creates many (8
or 16) terms for each value.

Since we originally set these defaults, a lot has changed... e.g. we
now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
a faster postings format, etc.

Index size is important because it limits how much of the index will
be hot (fit in the OS's IO cache).  And more apps are using Lucene for
tiny docs where the overhead of individual fields is sizable.

I used the Geonames corpus to run a simple benchmark (all sources are
committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
with these numeric fields:

  * lat/lng (double)
  * modified time, elevation, population (long)
  * dem (int)

I tested 4, 8 and 16 precision steps:

{noformat}
indexing:

PrecStep        Size        IndexTime
       4   1812.7 MB        651.4 sec
       8   1203.0 MB        443.2 sec
      16    894.3 MB        361.6 sec


searching:

     Field  PrecStep   QueryTime   TermCount
 geoNameID         4   2872.5 ms       20306
 geoNameID         8   2903.3 ms      104856
 geoNameID        16   3371.9 ms     5871427
  latitude         4   2160.1 ms       36805
  latitude         8   2249.0 ms      240655
  latitude        16   2725.9 ms     4649273
  modified         4   2038.3 ms       13311
  modified         8   2029.6 ms       58344
  modified        16   2060.5 ms       77763
 longitude         4   3468.5 ms       33818
 longitude         8   3629.9 ms      214863
 longitude        16   4060.9 ms     4532032
{noformat}

Index time is with 1 thread (for identical index structure).

The query time is time to run 100 random ranges for that field,
averaged over 20 iterations.  TermCount is the total number of terms
the MTQ rewrote to across all 100 queries / segments, and it gets
higher as expected as precStep gets higher, but the search time is not
that heavily impacted ... negligible going from 4 to 8, and then some
impact from 8 to 16.

Maybe we should increase the int/float default precision step to 8 and
long/double to 16?  Or both to 16?




--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-5609) Should we revisit the default numeric precision step?

Reply via email to