[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715609#comment-16715609 ] Hoss Man commented on SOLR-11916: - {quote}[~hossman] using this field type for distributed faceting can lead to wrong results. Maybe this should be noted in the JavaDoc or the Solr documentation? {quote} Interesting ... not something i'd considered. I think you need to file a new Big for this, and then sure -- update the docs to note it as a limitation -- frankly it seems like (IIUC) the real "bug" is is that the faceting code doesn't do it's refinement queries in a way that ensures a direct "Term" query (and the field type doesn't know the context of what it's being asked for during refinement, so it builds a PhraseQuery) -- ie: i'm guessing you'd see the exact same "bug" if you faceted on a TextField where the index analyzer used KeywordTokenizer but the query analyzer using WhitespaceTokenizer ... but this is a conversation that should be had in a new jira. > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715516#comment-16715516 ] Toke Eskildsen commented on SOLR-11916: --- [~hossman] using this field type for distributed faceting can lead to wrong results. Maybe this should be noted in the JavaDoc or the Solr documentation? This can be demonstrated by installing the cloud-version of the {{gettingstarted}} sample with {{./solr -e cloud}} using defaults all the way, except for {{shards}} which should be {{3}}. After that a corpus can be indexed with {{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo "\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo '\{"id":"duplicate_1","facet_t_sort":"a b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/gettingstarted/update?commit=true'}} This will index 100 documents with a single-valued field {{facet_t_sort:"a b X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. The call curl 'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort=5=on=*:*=0' should return "a b" as the top facet term with count 2, but returns {{{}} {{ "responseHeader":{}} {{ "zkConnected":true,}} {{ "status":0,}} {{ "QTime":13,}} {{ "params":{}} {{ "facet.limit":"5",}} {{ "q":"*:*",}} {{ "facet.field":"facet_t_sort",}} {{ "rows":"0",}} {{ "facet":"on"}},}} {{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}} {{ },}} {{ "facet_counts":{}} {{ "facet_queries":{},}} {{ "facet_fields":{}} {{ "facet_t_sort":[}} {{ "a b",36,}} {{ "a b 0",1,}} {{ "a b 1",1,}} {{ "a b 10",1,}} {{ "a b 11",1]},}} {{ "facet_ranges":{},}} {{ "facet_intervals":{},}} {{ "facet_heatmaps":{} The problem is the second phase of simple faceting, where the fine-counting happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It wins the popularity contest as there are 2 "a b"-terms and only 1 of all the other terms. The 1 or 2 shards that did not deliver "a b" in the first phase are then queried for the count for "a b", which happens in the form of a {{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer chain and thus matches _all_ the documents in that shard (approximately 102/3). An alternative would be to do the fine-counting on the DocValues instead, but that works very poorly with many values, so that seems more like a trap than a solution. > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true"
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421209#comment-16421209 ] Jan Høydahl commented on SOLR-11916: Hoss, thanks for the pointer, being able to configure analyzer for docValue would solve this, so the name can still keep its promise, although changing analyzer for docValue to support Norwegian sorting may break faceting on the same field, which brings you back to copyField anyways :) > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421028#comment-16421028 ] Hoss Man commented on SOLR-11916: - That is a major part of what's proposed in SOLR-11917 – along with a suggested approach for refactoring SortableTextField to be syntactic sugar after the fact. > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421023#comment-16421023 ] David Smiley commented on SOLR-11916: - It would be neat if there was a to customize the docValue encoding. This would address Jan's point? > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420992#comment-16420992 ] Hoss Man commented on SOLR-11916: - {{TextFileldWithDV}} or something similar is too limiting in terms of other future fields that might also support both analysis and docvalues (see SOLR-11917) ... likewise {{Facetable}} would be very missleading for people who currently facet on {{TextField}} (via uninversion) and see the individual – post-analysis – terms as the facet constraints. "Sortable" was chosen to convey it's primary usecase is for "sorting" on text fields in the same way you can "sort" on StrField > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420986#comment-16420986 ] Jan Høydahl commented on SOLR-11916: Guess this is intended only for very simple sort use cases that do not require any ICU collation etc? So if you have any non-English text you’d probably need to fall back to the copyField trick anyway. Which begs the question whether *Sortable* in class name is promising too much? Facetable or TextFileldWithDV could be other choices? > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Fix For: 7.3, master (8.0) > > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349106#comment-16349106 ] ASF subversion and git services commented on SOLR-11916: Commit fb0e04e5bc0e79eb137e2e7944a2933a19163c35 in lucene-solr's branch refs/heads/branch_7x from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fb0e04e ] SOLR-11916: new SortableTextField which supports analysis/searching just like TextField, but also sorting/faceting just like StrField (cherry picked from commit 95122e14481a4dd623e184ca261f8bf158fd3a7c) > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348981#comment-16348981 ] ASF subversion and git services commented on SOLR-11916: Commit 95122e14481a4dd623e184ca261f8bf158fd3a7c in lucene-solr's branch refs/heads/master from Chris Hostetter [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=95122e1 ] SOLR-11916: new SortableTextField which supports analysis/searching just like TextField, but also sorting/faceting just like StrField > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347859#comment-16347859 ] Hoss Man commented on SOLR-11916: - bq. re: useDocValuesAsStored – here's my straw man proposal after sleeping on it a bit... I've updated the patch to implement this along with new tests. Unless there are additional concerns/suggestions, i'll move forward w/committing & backporting tommorow. bq. I had a number of use cases with an indexed field that required faceting over the original input (this would work with this type of field too, right?). Yep yep ... absolutely. For example, this sort of logic is currently in {{TestSortableTextField.testSimpleSearchAndFacets()}} ... {code} assertU(adoc("id","1", "whitespace_stxt", "how now brown cow ?")); assertU(adoc("id","2", "whitespace_stxt", "how now brown cow ?")); assertU(adoc("id","3", "whitespace_stxt", "holy cow !")); assertU(adoc("id","4", "whitespace_stxt", "dog and cat")); assertU(commit()); final String facet = "whitespace_stxt"; final String search = "whitespace_stxt"; // facet.field final String fpre = "//lst[@name='facet_fields']/lst[@name='"+facet+"']/"; assertQ(req("q", search + ":cow", "rows", "0", "facet.field", facet, "facet", "true") , "//*[@numFound='3']" , fpre + "int[@name='how now brown cow ?'][.=2]" , fpre + "int[@name='holy cow !'][.=1]" , fpre + "int[@name='dog and cat'][.=0]" ); // json facet final String jpre = "//lst[@name='facets']/lst[@name='x']/arr[@name='buckets']/"; assertQ(req("q", search + ":cow", "rows", "0", "json.facet", "{x:{ type: terms, field:'" + facet + "', mincount:0 }}") , "//*[@numFound='3']" , jpre + "lst[str[@name='val'][.='how now brown cow ?']][int[@name='count'][.=2]]" , jpre + "lst[str[@name='val'][.='holy cow !']][int[@name='count'][.=1]]" , jpre + "lst[str[@name='val'][.='dog and cat']][int[@name='count'][.=0]]" ); {code} ...allthough in the actual test: the "whitespace_stxt" field is copyFielded into many other fields w/ slightly diff configurations, and the "facet" and "search" variables are assigned in nested loops to prove that the "search" field behavior is consistent as long as the fields are indexed & the "facet" field behavior is consistent as long as the fields have docValues. (In the latest patch, I even updated this to include a traditional TextField copy in the "search" permutations, and a traditional StrField copy in the "facet" permutations.) > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch, SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false"
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345484#comment-16345484 ] Dawid Weiss commented on SOLR-11916: Just wanted to say this is indeed a fairly common scenario in my (limited) Solr experience. I had a number of use cases with an indexed field that required faceting over the original input (this would work with this type of field too, right?). The workaround with a different copyTo field wasn't too pretty... > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345180#comment-16345180 ] David Smiley commented on SOLR-11916: - +1 awesome -- best of both worlds :-) > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344527#comment-16344527 ] Hoss Man commented on SOLR-11916: - re: useDocValuesAsStored -- here's my straw man proposal after sleeping on it a bit... * SortableTextField.init should override the schemaVersion based implicit default in FieldType.init ** this means by default, no fieldType/field using SortableTextField w/default to useDocValuesAsStored * SortableTextField.createFields should be aware of the effective value of SchemaField.useDocValuesAsStored and if it's true: fail (_at index time_) if any field values being added are longer then the (effective) maxCharsForDocValues ** this error message should be very clear about what's happening, mentioning both maxCharsForDocValues, and useDocValuesAsStored. Net result: * clients that try to add huge values to fields with maxCharsForDocValues=small may get 2 diff behaviors depending on field's useDocValuesAsStored: ** if useDocValuesAsStored==false: *** docvalues are truncated ** if useDocValuesAsStored==true: *** request fails because solr can't "fit" the huge value into the "small" limit that's been configured * ie: "the schema told us doc values should be limited to 'small' and to use doc values as if they were stored fields, and we can't meet those two expectations for your 'huge' field value, so we're rejecting it" ...i'm pretty sure this is all doable (even if the useDocValuesAsStored is specified on either the fieldType or the field) and i'll test it out soon. > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used – but this could be overridden > with configuration. > > Consider the following sample configuration: > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: > {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code:java} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341765#comment-16341765 ] Hoss Man commented on SOLR-11916: - Hmmm... good point. I thought that since this new type didn't go out of it's way to enable "useDocValuesAsStored" that it was a non-issue and the docValues would _never_ be used in place of a stored value (but even if that were true, we should definitely have a test proving it) Skimming the relevant code now I realize that as a FieldProperty, the schemaVersion is the only thing that drives the (default) value of USE_DOCVALUES_AS_STORED regardless of the FieldType impl – so you are absolutely correct, we need to do "something" in {{SortableTextField}} to account for this propery. {quote}Perhaps this fieldType overrides useDocValuesAsStored() to see if maxChars=-1 (no limit) so it can vary it's output based on that? {quote} Hmm ... My concern with that approach is that it might be confusing the users how explicitly set {{useDocValuesAsStored="true" stored="false"}} on a fieldType (or field) – perhaps w/o even being aware of the default maxChars safety valve – and then don't understand why they aren't getting any values back? One possibility would be be for {{SortableTextField.init}} to override the (implicit) default {{useDocValuesAsStored=(schemaVersion>1.6)}} with it's own default based on {{useDocValuesAsStored=(maxChars==-1)}} _and_ fail with a server error (on init) if a configuration includes an explicit {{useDocValuesAsStored=true maxChars="anything other then -1"}} ? Personally, my vote would be – at least initially – to just say "useDocValuesAsStored is not supported for SortableTextField", set the default appropriately & fail on init if anyone tries to explicitly set it to "true" but since FieldProperites can be set on both the {{fieldType}} and the {{field}} I don't think it would even be possible for a fieldType to *stop* someone from creating a {{ new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used -- but this could be overridden > with configuration. > > Consider the following sample configuration: > {code} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input
[ https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341722#comment-16341722 ] David Smiley commented on SOLR-11916: - Cool! {{maxChars}} raises some concerns that need to be documented and cared for in the code. We'd have a field with docValues, but it's value isn't necessarily a substitute for the "stored" version. We have code in places right now that will need to know about this exception to useDocValueAsStored. Perhaps this fieldType overrides useDocValuesAsStored() to see if maxChars=-1 (no limit) so it can vary it's output based on that? > new SortableTextField using docValues built from the original string input > -- > > Key: SOLR-11916 > URL: https://issues.apache.org/jira/browse/SOLR-11916 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Hoss Man >Assignee: Hoss Man >Priority: Major > Attachments: SOLR-11916.patch > > > I propose adding a new SortableTextField subclass that would functionally > work the same as TextField except: > * {{docValues="true|false"}} could be configured, with the default being > "true" > * The docValues would contain the original input values (just like StrField) > for sorting (or faceting) > ** By default, to protect users from excessively large docValues, only the > first 1024 of each field value would be used -- but this could be overridden > with configuration. > > Consider the following sample configuration: > {code} > indexed="true" docValues="true" stored="true" multiValued="false"/> > > >... > > >... > > > {code} > Given a document with a title of "Solr In Action" > Users could: > * Search for individual (indexed) terms in the "title" field: {{q=title:solr}} > * Sort documents by title ( {{sort=title asc}} ) such that this document's > sort value would be "Solr In Action" > If another document had a "title" value that was longer then 1024 chars, then > the docValues would be built using only the first 1024 characters of the > value (unless the user modified the configuration) > This would be functionally equivalent to the following existing configuration > - including the on disk index segments - except that the on disk DocValues > would refer directly to the "title" field, reducing the total number of > "field infos" in the index (which has a small impact on segment housekeeping > and merge times) and end users would not need to sort on an alternate > "title_string" field name - the original "title" field name would always be > used directly. > {code} > indexed="true" docValues="true" stored="true" multiValued="false"/> > indexed="false" docValues="true" stored="false" multiValued="false"/> > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org