[jira] [Comment Edited] (LUCENE-5609) Should we revisit the default numeric precision step?

Uwe Schindler (JIRA) Sun, 20 Apr 2014 09:46:32 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975183#comment-13975183
 ]


Uwe Schindler edited comment on LUCENE-5609 at 4/20/14 4:45 PM:
----------------------------------------------------------------

bq. Have a look at LUCENE-1470, even 2 was considered then.

That was not really useable even at that time! The improvements in contrast to 
4 were zero. It was even worse (because the term dictionary got larger, which 
had impact in 2.x and 3.x. At that time, I was always using 8 as precisionStep 
for longs and ints. The same applied for Solr. Lucene was the only one using 4 
as default. ElasticSearch was cloning Lucene's standards.

I would really prefer to use 8 for both ints and longs. The change from 8 to 16 
is increasing the number of terms while range query immense and the index size 
between 8 and 16 is not really a problem. To me it has also shown that because 
of the way how floats/doubles are encoded, the precision step of 8 is really 
good for longs. In most cases stuff never changes (like exponent), so there is 
exactly one term indexed for that.

With a precision step of 16  I would imagine the differences between 16 and 64 
would be neglectible, too :-) The main reason for having lower precision steps 
are indexes were the values are equally distributed. For stuff like values 
clustered around some numbers, the precisionstep is irrelevant! In most cases 
because the way how it works, for larger shifts the indexed value is constant, 
so you have one or 2 terms that hit all documents and are never used by the 
range query..

So before changing the default, I would suggest to have a test with an index 
that has equally distributed numbers of the full 64 bit range.

bq. I think 11 is better than 12

...because the last term is better used. The number of terms indexed is the 
same for 11 and 12 (6*11=66, 6*12=72, but 5*12=60 is too small). But 
unfortunately this is not a multiple of 4, so would not be backwards compatible.

I think the main problem of this issue is, that we only have *one* default. 
Sombeody never doing any ranges does not need the additional terms at all. 
That's the main problem. Solr is better here, as it provided 2 predefined field 
types, but Lucene only has one - and that is the bug.

So my proposal: Provide a 2nd field type as a 2nd default with correct 
documnetation, suggesting it to users, only wanting to index numeric 
identifiers or non-docvalues fields they want to sort on.

And second, we should do LUCENE-5605 - I started with it last week, but was 
interrupted by _NativeFSIndexCorrumpter_ :-)  The problem is the precisionStep 
alltogether! We should make it an implementation detail. When constructing a 
NRQ, you should not need to pass it. Because of this I opened LUCENE-5605, so 
anybody creating a NRQ/NRF should pass the FieldType to the NRQ ctor, not an 
arbitrary number. Then its ensured that the people use the same settings for 
indexing and querying.

Together with this, we should provide 2 predfined field types per data type and 
remove the constant from NumericUtils completely. The 2 field types per data 
type might be named like DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and 
DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better names and javadocs). And 
we should make 8 the new default, which is fully backwards compatible. And hide 
the precision step completely! 16 is really too large for lots of queries. And 
difference in index size is neglectibale, unless you have a purely numeric 
index (in which case you should use a RDBMS instead of an Lucene index to query 
your data :-) !). Indexing time is also, as Mike discovered not a problem at 
all. If people don't reuse the IntField instance, its always as slow, because 
the TokenStream has to be recreated on every number. The number of terms is not 
the issue at all, sorry!

About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always 
uses 4 as precStep if it detects a numeric or date type. ES should change this, 
but maybe have a bit more intelligent "guessing". E.g., If you index the "_id" 
field as an integer, it should automatically use infinite 
(DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the 
"_id" field. For all standard numeric fields it should use precstep=8.


was (Author: thetaphi):
bq. Have a look at LUCENE-1470, even 2 was considered then.

That was not really useable even at that time! The improvements in contrast to 
4 were zero. It was even worse (because the term dictionary got larger, which 
had impact in 2.x and 3.x. At that time, I was always using 8 as precisionStep 
for longs and ints. The same applied for Solr. Lucene was the only one using 4 
as default. ElasticSearch was cloning Lucene's standards.

I would really prefer to use 8 for both ints and longs. The change from 8 to 16 
is increasing the number of terms immense and the index size between 8 and 16 
is not really a problem. To me it has also shown that because of the way how 
floats/doubles are encoded, the precision step of 8 is really good for longs. 
In most cases stuff never changes (like exponent), so there is exactly one term 
indexed for that.

With a precision step of 16  I would imagine the differences between 16 and 64 
would be neglectible, too :-) The main reason for having lower precision steps 
are indexes were the values are equally distributed. For stuff like values 
clustered around some numbers, the precisionstep is irrelevant! In most cases 
because the way how it works, for larger shifts the indexed value is constant, 
so you have one or 2 terms that hit all documents and are never used by the 
range query..

So before changing the default, I would suggest to have a test with an index 
that has equally distributed numbers of the full 64 bit range.

bq. I think 11 is better than 12

...because the last term is better used. The number of terms indexed is the 
same for 11 and 12 (6*11=66, 6*12=72, but 5*12=60 is too small). But 
unfortunately this is not a multiple of 4, so would not be backwards compatible.

I think the main problem of this issue is, that we only have *one* default. 
Sombeody never doing any ranges does not need the additional terms at all. 
That's the main problem. Solr is better here, as it provided 2 predefined field 
types, but Lucene only has one - and that is the bug.

So my proposal: Provide a 2nd field type as a 2nd default with correct 
documnetation, suggesting it to users, only wanting to index numeric 
identifiers or non-docvalues fields they want to sort on.

And second, we should do LUCENE-5605 - I started with it last week, but was 
interrupted by _NativeFSIndexCorrumpter_ :-)  The problem is the precisionStep 
alltogether! We should make it an implementation detail. When constructing a 
NRQ, you should not need to pass it. Because of this I opened LUCENE-5605, so 
anybody creating a NRQ/NRF should pass the FieldType to the NRQ ctor, not an 
arbitrary number. Then its ensured that the people use the same settings for 
indexing and querying.

Together with this, we should provide 2 predfined field types per data type and 
remove the constant from NumericUtils completely. The 2 field types per data 
type might be named like DEFAULT_INT_FOR_RANGEQUERY_FILEDTYPE and 
DEFAULT_INT_OTHERIWSE_FIELDTYPE (please choose better names and javadocs). And 
we should make 8 the new default, which is fully backwards compatible. And hide 
the precision step completely! 16 is really too large for lots of queries. And 
difference in index size is neglectibale, unless you have a purely numeric 
index (in which case you should use a RDBMS instead of an Lucene index to query 
your data :-) !). Indexing time is also, as Mike discovered not a problem at 
all. If people don't reuse the IntField instance, its always as slow, because 
the TokenStream has to be recreated on every number. The number of terms is not 
the issue at all, sorry!

About ElasticSearch: Unfortunately the schemaless mode of ElasticSearch always 
uses 4 as precStep if it detects a numeric or date type. ES should change this, 
but maybe have a bit more intelligent "guessing". E.g., If you index the "_id" 
field as an integer, it should automatically use infinite 
(DEFAULT_INT_OTHERIWSE_TYPE) precStep - nobody would do range queries on the 
"_id" field. For all standard numeric fields it should use precstep=8.

> Should we revisit the default numeric precision step?
> -----------------------------------------------------
>
>                 Key: LUCENE-5609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5609
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Michael McCandless
>             Fix For: 4.9, 5.0
>
>         Attachments: LUCENE-5609.patch
>
>
> Right now it's 4, for both 8 (long/double) and 4 byte (int/float)
> numeric fields, but this is a pretty big hit on indexing speed and
> disk usage, especially for tiny documents, because it creates many (8
> or 16) terms for each value.
> Since we originally set these defaults, a lot has changed... e.g. we
> now rewrite MTQs per-segment, we have a faster (BlockTree) terms dict,
> a faster postings format, etc.
> Index size is important because it limits how much of the index will
> be hot (fit in the OS's IO cache).  And more apps are using Lucene for
> tiny docs where the overhead of individual fields is sizable.
> I used the Geonames corpus to run a simple benchmark (all sources are
> committed to luceneutil). It has 8.6 M tiny docs, each with 23 fields,
> with these numeric fields:
>   * lat/lng (double)
>   * modified time, elevation, population (long)
>   * dem (int)
> I tested 4, 8 and 16 precision steps:
> {noformat}
> indexing:
> PrecStep        Size        IndexTime
>        4   1812.7 MB        651.4 sec
>        8   1203.0 MB        443.2 sec
>       16    894.3 MB        361.6 sec
> searching:
>      Field  PrecStep   QueryTime   TermCount
>  geoNameID         4   2872.5 ms       20306
>  geoNameID         8   2903.3 ms      104856
>  geoNameID        16   3371.9 ms     5871427
>   latitude         4   2160.1 ms       36805
>   latitude         8   2249.0 ms      240655
>   latitude        16   2725.9 ms     4649273
>   modified         4   2038.3 ms       13311
>   modified         8   2029.6 ms       58344
>   modified        16   2060.5 ms       77763
>  longitude         4   3468.5 ms       33818
>  longitude         8   3629.9 ms      214863
>  longitude        16   4060.9 ms     4532032
> {noformat}
> Index time is with 1 thread (for identical index structure).
> The query time is time to run 100 random ranges for that field,
> averaged over 20 iterations.  TermCount is the total number of terms
> the MTQ rewrote to across all 100 queries / segments, and it gets
> higher as expected as precStep gets higher, but the search time is not
> that heavily impacted ... negligible going from 4 to 8, and then some
> impact from 8 to 16.
> Maybe we should increase the int/float default precision step to 8 and
> long/double to 16?  Or both to 16?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5609) Should we revisit the default numeric precision step?

Reply via email to