[ https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974862#comment-13974862 ]
Uwe Schindler commented on LUCENE-5609: --------------------------------------- I would use precStep 8 for ints (Solr does this already by default). As we need a multiple of 4: Mike: Can you check number of terms and index size for precStep=12? 16 is way too big in my opinion and my tests in the past.. The overhead is not soo big as you might think. The problem is if you have an index solely of numerics. To have a real comparison, you should use something like Wikipedia and maybe add something like the lastmod date as long field. And then test Also we have lots of queries with at least up to 8 different numeric fields in parallel (half open ranges). For that there is still ahuge improvement with lower prec steps. I found out that 8 is best. 16 hurts very much, if you query multiple numeric fields anded/ored together. Also not everybody has the index completely in memory! If you have a pure in-memory index, you could theoretically also disable tries completely :-) The numeric fields are made for indexes with lots of disk/ssd IO (because you have many numeric fields combined with simple full text queries and some facets. So please also check complex queries on really large indexes, not just simple range filters on small indexes with solely numeric fields. > Should we revisit the default numeric precision step? > ----------------------------------------------------- > > Key: LUCENE-5609 > URL: https://issues.apache.org/jira/browse/LUCENE-5609 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search > Reporter: Michael McCandless > Fix For: 4.9, 5.0 > > > Right now it's 4, for both 8 (long/double) and 4 byte (int/float) > numeric fields, but this is a pretty big hit on indexing speed and > disk usage, especially for tiny documents, because it creates many (8 > or 16) terms for each value. > Since we originally set these defaults, a lot has changed... e.g. we > now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict, > a faster postings format, etc. > Index size is important because it limits how much of the index will > be hot (fit in the OS's IO cache). And more apps are using Lucene for > tiny docs where the overhead of individual fields is sizable. > I used the Geonames corpus to run a simple benchmark (all sources are > committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields, > with these numeric fields: > * lat/lng (double) > * modified time, elevation, population (long) > * dem (int) > I tested 4, 8 and 16 precision steps: > {noformat} > indexing: > PrecStep Size IndexTime > 4 1812.7 MB 651.4 sec > 8 1203.0 MB 443.2 sec > 16 894.3 MB 361.6 sec > searching: > Field PrecStep QueryTime TermCount > geoNameID 4 2872.5 ms 20306 > geoNameID 8 2903.3 ms 104856 > geoNameID 16 3371.9 ms 5871427 > latitude 4 2160.1 ms 36805 > latitude 8 2249.0 ms 240655 > latitude 16 2725.9 ms 4649273 > modified 4 2038.3 ms 13311 > modified 8 2029.6 ms 58344 > modified 16 2060.5 ms 77763 > longitude 4 3468.5 ms 33818 > longitude 8 3629.9 ms 214863 > longitude 16 4060.9 ms 4532032 > {noformat} > Index time is with 1 thread (for identical index structure). > The query time is time to run 100 random ranges for that field, > averaged over 20 iterations. TermCount is the total number of terms > the MTQ rewrote to across all 100 queries / segments, and it gets > higher as expected as precStep gets higher, but the search time is not > that heavily impacted ... negligible going from 4 to 8, and then some > impact from 8 to 16. > Maybe we should increase the int/float default precision step to 8 and > long/double to 16? Or both to 16? -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org