[
https://issues.apache.org/jira/browse/SOLR-5936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954476#comment-13954476
]
Uwe Schindler commented on SOLR-5936:
-------------------------------------
Hi Jack,
bq. And if trie really is the best approach for numeric fields, why not just do
all of this under the hood instead of polluting the field type names with
"trie"? IOW, rename TrieIntField to IntField, etc.
This goes back to the introduction of that in Lucene 2.9 / Solr 1.4. At that
time everybody was using other field types, and stuff like IntField,
SortableIntField,.. was already used as *names*. Because of that it was
introduced to Solr with the name based on the original donated code (by me).
Shortly later, Lucene renamed the field to be "NumericField" and
"NumericRangeQuery" the query. The term "trie" is no longer used in Lucene and
only the term "precisionStep" as a configureable flag for the number of
additional term remained (in the documentation). So
"Trie(Int|Long|Float|Double|Date)Field" is just there for "backwards
compatibility" with earlier indexes (in Solr 1.4) and now, because the name is
baked in, no way to change anymore.
+1 to rename for 5.0
bq. As part of this cleanup, could somebody volunteer to create a plain-English
summary of exactly what a trie field really is, what good it is, and why we
can't live without them? I've read the code and, okay, there is a sequence of
bit shifts and generation of extra terms, but in plain English, what's the
point?
See javadocs of NumericRangeQuery.
bq. Specifically, for example, does it matter if a field has an evenly
distributed range of numeric values with little repetition vs. numeric codes
where there is a relatively small number of distinct values (e.g., 1-10, or
scores of 0-100 or dates in years between 1970 and 2014) and relatively high
cardinality?
This does not matter because of the structure of the additional terms. The
number of terms used for actual ranges is almost always around the approx.
expected number (see javadocs of NRQ). It also does not matter if it is a date
or a int or a float. Internally, for trie, there are no floats or dates at all.
Everything is mapped to the sortable bits (means if value_a < value_b also the
bits_of_value_a < bits_of_value_b). It also has no real effect on the size of
the range. Lucene always matches approximately the same number of terms (a few
hundreds at maximum).
Simply said, you are indexing all numbers as bits like strings formed as
"10110110" (just in a better compressed way), with additional terms stripping
some bits from the right (like "10110110", "101101", "1011", "10"). Ranges are
then simplified to match middle parts of the range with shorter terms that
match more documents. For that algorithm, the distribution of values is not
that important. Index size only grows by a minimum size, because the shorter
terms are more rare (approx. 12% more terms), with large posting lists (many
docs match). But as those terms match many sequential docs, the posting lists
are not so big (because of the delta encoding). So trie terms raise the index
size only by a few percents, but make range queries ultimatively fast, because
ranges can be matched with few terms hitting many documents.
bq. I mean, does trie do a uniformly great job for both of these extreme use
cases, including for faceting?
It is not used for facetting. Facetting does not use the additional terms. For
facetting use DocValues instead of indexed fields. If you want to use Trie
fields, and don't want to search on them with ranges, you can switch of the
additional terms by setting precStep to 0.
One last note from my side:
I agree with removing the impl details from the user. The user in my opinion
only needs 2 types of numerics: precisionStep=4 or 8 (I think the default in
solr is 8, although I disagree - e.g., Elasticsearch uses the Lucene default of
4) and another one with precisonStep=infinity (0 in solr would) for numerics
that are only for sorting and don't need range queries.
> Deprecate non-Trie-based numeric (and date) field types in 4.x and remove
> them from 5.0
> ---------------------------------------------------------------------------------------
>
> Key: SOLR-5936
> URL: https://issues.apache.org/jira/browse/SOLR-5936
> Project: Solr
> Issue Type: Task
> Components: Schema and Analysis
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Priority: Minor
> Fix For: 4.8, 5.0
>
> Attachments: SOLR-5936.branch_4x.patch, SOLR-5936.branch_4x.patch
>
>
> We've been discouraging people from using non-Trie numeric&date field types
> for years, it's time we made it official.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]