[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715516#comment-16715516
 ] 

Toke Eskildsen commented on SOLR-11916:
---------------------------------------

[~hossman] using this field type for distributed faceting can lead to wrong 
results. Maybe this should be noted in the JavaDoc or the Solr documentation?

This can be demonstrated by installing the cloud-version of the 
{{gettingstarted}} sample with

{{./solr -e cloud}}

using defaults all the way, except for {{shards}} which should be {{3}}. After 
that a corpus can be indexed with

{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
"\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
'\{"id":"duplicate_1","facet_t_sort":"a 
b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
'Content-Type: application/json' 
'http://localhost:8983/solr/gettingstarted/update?commit=true'}}

This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. 
The call

curl 
'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort&facet.limit=5&facet=on&q=*:*&rows=0'

should return "a b" as the top facet term with count 2, but returns

{{{}}
{{ "responseHeader":{}}
{{ "zkConnected":true,}}
{{ "status":0,}}
{{ "QTime":13,}}
{{ "params":{}}
{{ "facet.limit":"5",}}
{{ "q":"*:*",}}
{{ "facet.field":"facet_t_sort",}}
{{ "rows":"0",}}
{{ "facet":"on"}},}}
{{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
{{ },}}
{{ "facet_counts":{}}
{{ "facet_queries":{},}}
{{ "facet_fields":{}}
{{ "facet_t_sort":[}}
{{ "a b",36,}}
{{ "a b 0",1,}}
{{ "a b 1",1,}}
{{ "a b 10",1,}}
{{ "a b 11",1]},}}
{{ "facet_ranges":{},}}
{{ "facet_intervals":{},}}
{{ "facet_heatmaps":{}}}}}

The problem is the second phase of simple faceting, where the fine-counting 
happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It 
wins the popularity contest as there are 2 "a b"-terms and only 1 of all the 
other terms. The 1 or 2 shards that did not deliver "a b" in the first phase 
are then queried for the count for "a b", which happens in the form of a 
{{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
chain and thus matches _all_ the documents in that shard (approximately 102/3).

An alternative would be to do the fine-counting on the DocValues instead, but 
that works very poorly with many values, so that seems more like a trap than a 
solution.

> new SortableTextField using docValues built from the original string input
> --------------------------------------------------------------------------
>
>                 Key: SOLR-11916
>                 URL: https://issues.apache.org/jira/browse/SOLR-11916
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>             Fix For: 7.3, master (8.0)
>
>         Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> ----
> Consider the following sample configuration:
> {code:java}
> <field name="title" type="text_sortable" docValues="true"
>        indexed="true" docValues="true" stored="true" multiValued="false"/>
> <fieldType name="text_sortable" class="solr.SortableTextField">
>   <analyzer type="index">
>    ...
>   </analyzer>
>   <analyzer type="query">
>    ...
>   </analyzer>
> </fieldType>
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> <field name="title" type="text"
>        indexed="true" docValues="true" stored="true" multiValued="false"/>
> <field name="title_string" type="string"
>        indexed="false" docValues="true" stored="false" multiValued="false"/>
> <copyField source="title" dest="title_string" maxCharsForDocValues="1024" />
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to