[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-12-10 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715609#comment-16715609
 ] 

Hoss Man commented on SOLR-11916:
-

{quote}[~hossman] using this field type for distributed faceting can lead to 
wrong results. Maybe this should be noted in the JavaDoc or the Solr 
documentation?
{quote}

Interesting ... not something i'd considered.  I think you need to file a new 
Big for this, and then sure -- update the docs to note it as a limitation -- 
frankly it seems like (IIUC) the real "bug" is is that the faceting code 
doesn't do it's refinement queries in a way that ensures a direct "Term" query 
(and the field type doesn't know the context of what it's being asked for 
during refinement, so it builds a PhraseQuery) -- ie: i'm guessing you'd see 
the exact same "bug" if you faceted on a TextField where the index analyzer 
used KeywordTokenizer but the query analyzer using WhitespaceTokenizer ... but 
this is a conversation that should be had in a new jira.

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-12-10 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715516#comment-16715516
 ] 

Toke Eskildsen commented on SOLR-11916:
---

[~hossman] using this field type for distributed faceting can lead to wrong 
results. Maybe this should be noted in the JavaDoc or the Solr documentation?

This can be demonstrated by installing the cloud-version of the 
{{gettingstarted}} sample with

{{./solr -e cloud}}

using defaults all the way, except for {{shards}} which should be {{3}}. After 
that a corpus can be indexed with

{{( echo '[' ; for J in $(seq 0 99); do ID=$((J)) ; echo 
"\{\"id\":\"$ID\",\"facet_t_sort\":\"a b $J\"},"; done ; echo 
'\{"id":"duplicate_1","facet_t_sort":"a 
b"},\{"id":"duplicate_2","facet_t_sort":"a b"}]' ) | curl -s -d @- -X POST -H 
'Content-Type: application/json' 
'http://localhost:8983/solr/gettingstarted/update?commit=true'}}

This will index 100 documents with a single-valued field {{facet_t_sort:"a b 
X"}} where X is the document number + 2 documents with {{facet_t_sort:"a b"}}. 
The call

curl 
'http://localhost:8983/solr/gettingstarted/select?facet.field=facet_t_sort=5=on=*:*=0'

should return "a b" as the top facet term with count 2, but returns

{{{}}
{{ "responseHeader":{}}
{{ "zkConnected":true,}}
{{ "status":0,}}
{{ "QTime":13,}}
{{ "params":{}}
{{ "facet.limit":"5",}}
{{ "q":"*:*",}}
{{ "facet.field":"facet_t_sort",}}
{{ "rows":"0",}}
{{ "facet":"on"}},}}
{{ "response":{"numFound":102,"start":0,"maxScore":1.0,"docs":[]}}
{{ },}}
{{ "facet_counts":{}}
{{ "facet_queries":{},}}
{{ "facet_fields":{}}
{{ "facet_t_sort":[}}
{{ "a b",36,}}
{{ "a b 0",1,}}
{{ "a b 1",1,}}
{{ "a b 10",1,}}
{{ "a b 11",1]},}}
{{ "facet_ranges":{},}}
{{ "facet_intervals":{},}}
{{ "facet_heatmaps":{}

The problem is the second phase of simple faceting, where the fine-counting 
happens. In the first phase, "a b" is returned from 1 or 2 of the 3 shards. It 
wins the popularity contest as there are 2 "a b"-terms and only 1 of all the 
other terms. The 1 or 2 shards that did not deliver "a b" in the first phase 
are then queried for the count for "a b", which happens in the form of a 
{{facet_t_sort:"a b"}}-lookup. It seems that this lookup uses the analyzer 
chain and thus matches _all_ the documents in that shard (approximately 102/3).

An alternative would be to do the fine-counting on the DocValues instead, but 
that works very poorly with many values, so that seems more like a trap than a 
solution.

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" 

[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-03-31 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421209#comment-16421209
 ] 

Jan Høydahl commented on SOLR-11916:


Hoss, thanks for the pointer, being able to configure analyzer for docValue 
would solve this, so the name can still keep its promise, although changing 
analyzer for docValue to support Norwegian sorting may break faceting on the 
same field, which brings you back to copyField anyways :)

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-03-30 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421028#comment-16421028
 ] 

Hoss Man commented on SOLR-11916:
-

That is a major part of what's proposed in SOLR-11917 – along with a suggested 
approach for refactoring SortableTextField to be syntactic sugar after the fact.

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-03-30 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421023#comment-16421023
 ] 

David Smiley commented on SOLR-11916:
-

It would be neat if there was a  to customize the 
docValue encoding.  This would address Jan's point?

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-03-30 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420992#comment-16420992
 ] 

Hoss Man commented on SOLR-11916:
-

{{TextFileldWithDV}} or something similar is too limiting in terms of other 
future fields that might also support both analysis and docvalues (see 
SOLR-11917) ... likewise {{Facetable}} would be very missleading for people who 
currently facet on {{TextField}} (via uninversion) and see the individual – 
post-analysis – terms as the facet constraints.

"Sortable" was chosen to convey it's primary usecase is for "sorting" on text 
fields in the same way you can "sort" on StrField

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-03-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420986#comment-16420986
 ] 

Jan Høydahl commented on SOLR-11916:


Guess this is intended only for very simple sort use cases that do not require 
any ICU collation etc? So if you have any non-English text you’d probably need 
to fall back to the copyField trick anyway. Which begs the question whether 
*Sortable* in class name is promising too much? Facetable or TextFileldWithDV 
could be other choices?

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Fix For: 7.3, master (8.0)
>
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-02-01 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349106#comment-16349106
 ] 

ASF subversion and git services commented on SOLR-11916:


Commit fb0e04e5bc0e79eb137e2e7944a2933a19163c35 in lucene-solr's branch 
refs/heads/branch_7x from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fb0e04e ]

SOLR-11916: new SortableTextField which supports analysis/searching just like 
TextField, but also sorting/faceting just like StrField

(cherry picked from commit 95122e14481a4dd623e184ca261f8bf158fd3a7c)


> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-02-01 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348981#comment-16348981
 ] 

ASF subversion and git services commented on SOLR-11916:


Commit 95122e14481a4dd623e184ca261f8bf158fd3a7c in lucene-solr's branch 
refs/heads/master from Chris Hostetter
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=95122e1 ]

SOLR-11916: new SortableTextField which supports analysis/searching just like 
TextField, but also sorting/faceting just like StrField


> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-01-31 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347859#comment-16347859
 ] 

Hoss Man commented on SOLR-11916:
-


bq. re: useDocValuesAsStored – here's my straw man proposal after sleeping on 
it a bit...

I've updated the patch to implement this along with new tests.

Unless there are additional concerns/suggestions, i'll move forward 
w/committing & backporting tommorow.



bq. I had a number of use cases with an indexed field that required faceting 
over the original input (this would work with this type of field too, right?).

Yep yep ... absolutely.

For example, this sort of logic is currently in 
{{TestSortableTextField.testSimpleSearchAndFacets()}} ...

{code}
assertU(adoc("id","1", "whitespace_stxt", "how now brown cow ?"));
assertU(adoc("id","2", "whitespace_stxt", "how now brown cow ?"));
assertU(adoc("id","3", "whitespace_stxt", "holy cow !"));
assertU(adoc("id","4", "whitespace_stxt", "dog and cat"));

assertU(commit());

final String facet = "whitespace_stxt";
final String search = "whitespace_stxt";
// facet.field
final String fpre = "//lst[@name='facet_fields']/lst[@name='"+facet+"']/";
assertQ(req("q", search + ":cow", "rows", "0", 
"facet.field", facet, "facet", "true")
, "//*[@numFound='3']"
, fpre + "int[@name='how now brown cow ?'][.=2]"
, fpre + "int[@name='holy cow !'][.=1]"
, fpre + "int[@name='dog and cat'][.=0]"
);

// json facet
final String jpre = 
"//lst[@name='facets']/lst[@name='x']/arr[@name='buckets']/";
assertQ(req("q", search + ":cow", "rows", "0", 
"json.facet", "{x:{ type: terms, field:'" + facet + "', mincount:0 
}}")
, "//*[@numFound='3']"
, jpre + "lst[str[@name='val'][.='how now brown cow 
?']][int[@name='count'][.=2]]"
, jpre + "lst[str[@name='val'][.='holy cow 
!']][int[@name='count'][.=1]]"
, jpre + "lst[str[@name='val'][.='dog and 
cat']][int[@name='count'][.=0]]"
);
{code}

...allthough in the actual test: the "whitespace_stxt" field is copyFielded 
into many other fields w/ slightly diff configurations, and the "facet" and 
"search" variables are assigned in nested loops to prove that the "search" 
field behavior is consistent as long as the fields are indexed & the "facet" 
field behavior is consistent as long as the fields have docValues.

(In the latest patch, I even updated this to include a traditional TextField 
copy in the "search" permutations, and a traditional StrField copy in the 
"facet" permutations.)



> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch, SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" 

[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-01-30 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345484#comment-16345484
 ] 

Dawid Weiss commented on SOLR-11916:


Just wanted to say this is indeed a fairly common scenario in my (limited) Solr 
experience. I had a number of use cases with an indexed field that required 
faceting over the original input (this would work with this type of field too, 
right?). The workaround with a different copyTo field wasn't too pretty...

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-01-30 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345180#comment-16345180
 ] 

David Smiley commented on SOLR-11916:
-

+1 awesome -- best of both worlds :-)

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-01-29 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16344527#comment-16344527
 ] 

Hoss Man commented on SOLR-11916:
-

re: useDocValuesAsStored -- here's my straw man proposal after sleeping on it a 
bit...

* SortableTextField.init should override the schemaVersion based implicit 
default in FieldType.init
** this means by default, no fieldType/field using SortableTextField w/default 
to useDocValuesAsStored
* SortableTextField.createFields should be aware of the effective value of 
SchemaField.useDocValuesAsStored and if it's true: fail (_at index time_) if 
any field values being added are longer then the (effective) 
maxCharsForDocValues
** this error message should be very clear about what's happening, mentioning 
both maxCharsForDocValues, and useDocValuesAsStored.

Net result: 
* clients that try to add huge values to fields with maxCharsForDocValues=small 
may get 2 diff behaviors depending on field's useDocValuesAsStored:
** if useDocValuesAsStored==false:
*** docvalues are truncated
** if useDocValuesAsStored==true:
*** request fails because solr can't "fit" the huge value into the "small" 
limit that's been configured
* ie: "the schema told us doc values should be limited to 'small' and to use 
doc values as if they were stored fields, and we can't meet those two 
expectations for your 'huge' field value, so we're rejecting it"


...i'm pretty sure this is all doable (even if the useDocValuesAsStored is 
specified on either the fieldType or the field) and i'll test it out soon.

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
>  * {{docValues="true|false"}} could be configured, with the default being 
> "true"
>  * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
>  ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used – but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
>  * Search for individual (indexed) terms in the "title" field: 
> {{q=title:solr}}
>  * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration)
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code:java}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-01-26 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341765#comment-16341765
 ] 

Hoss Man commented on SOLR-11916:
-

Hmmm... good point.

I thought that since this new type didn't go out of it's way to enable 
"useDocValuesAsStored" that it was a non-issue and the docValues would _never_ 
be used in place of a stored value (but even if that were true, we should 
definitely have a test proving it)

Skimming the relevant code now I realize that as a FieldProperty, the 
schemaVersion is the only thing that drives the (default) value of 
USE_DOCVALUES_AS_STORED regardless of the FieldType impl – so you are 
absolutely correct, we need to do "something" in {{SortableTextField}} to 
account for this propery.
{quote}Perhaps this fieldType overrides useDocValuesAsStored() to see if 
maxChars=-1 (no limit) so it can vary it's output based on that?
{quote}
 
 Hmm ... My concern with that approach is that it might be confusing the users 
how explicitly set {{useDocValuesAsStored="true" stored="false"}} on a 
fieldType (or field) – perhaps w/o even being aware of the default maxChars 
safety valve – and then don't understand why they aren't getting any values 
back?

One possibility would be be for {{SortableTextField.init}} to override the 
(implicit) default {{useDocValuesAsStored=(schemaVersion>1.6)}} with it's own 
default based on {{useDocValuesAsStored=(maxChars==-1)}} _and_ fail with a 
server error (on init) if a configuration includes an explicit 
{{useDocValuesAsStored=true maxChars="anything other then -1"}} ?

Personally, my vote would be – at least initially – to just say 
"useDocValuesAsStored is not supported for SortableTextField", set the default 
appropriately & fail on init if anyone tries to explicitly set it to "true" 
 but since FieldProperites can be set on both the {{fieldType}} and the 
{{field}} I don't think it would even be possible for a fieldType to *stop* 
someone from creating a {{ new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
> * {{docValues="true|false"}} could be configured, with the default being 
> "true"
> * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
> ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used -- but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
> * Search for individual (indexed) terms in the "title" field: {{q=title:solr}}
> * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration) 
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11916) new SortableTextField using docValues built from the original string input

2018-01-26 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16341722#comment-16341722
 ] 

David Smiley commented on SOLR-11916:
-

Cool!

{{maxChars}} raises some concerns that need to be documented and cared for in 
the code.  We'd have a field with docValues, but it's value isn't necessarily a 
substitute for the "stored" version.  We have code in places right now that 
will need to know about this exception to useDocValueAsStored.  Perhaps this 
fieldType overrides useDocValuesAsStored() to see if maxChars=-1 (no limit) so 
it can vary it's output based on that?

> new SortableTextField using docValues built from the original string input
> --
>
> Key: SOLR-11916
> URL: https://issues.apache.org/jira/browse/SOLR-11916
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Hoss Man
>Assignee: Hoss Man
>Priority: Major
> Attachments: SOLR-11916.patch
>
>
> I propose adding a new SortableTextField subclass that would functionally 
> work the same as TextField except:
> * {{docValues="true|false"}} could be configured, with the default being 
> "true"
> * The docValues would contain the original input values (just like StrField) 
> for sorting (or faceting)
> ** By default, to protect users from excessively large docValues, only the 
> first 1024 of each field value would be used -- but this could be overridden 
> with configuration.
> 
> Consider the following sample configuration:
> {code}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> 
>   
>...
>   
>   
>...
>   
> 
> {code}
> Given a document with a title of "Solr In Action"
> Users could:
> * Search for individual (indexed) terms in the "title" field: {{q=title:solr}}
> * Sort documents by title ( {{sort=title asc}} ) such that this document's 
> sort value would be "Solr In Action"
> If another document had a "title" value that was longer then 1024 chars, then 
> the docValues would be built using only the first 1024 characters of the 
> value (unless the user modified the configuration) 
> This would be functionally equivalent to the following existing configuration 
> - including the on disk index segments - except that the on disk DocValues 
> would refer directly to the "title" field, reducing the total number of 
> "field infos" in the index (which has a small impact on segment housekeeping 
> and merge times) and end users would not need to sort on an alternate 
> "title_string" field name - the original "title" field name would always be 
> used directly.
> {code}
> indexed="true" docValues="true" stored="true" multiValued="false"/>
> indexed="false" docValues="true" stored="false" multiValued="false"/>
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org