[
https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341687#comment-16341687
]
Hoss Man commented on SOLR-11917:
---------------------------------
h1. Some Concrete Thoughts On *S*olutions
*NOTE:* While there is a one-to-one corrispondice in the naming/numbering of
the *U*secases listed above and the proposed *S*olutions listed below, I have
ordered the *S*olutions in the way that I think makes the most sense from an
"explaining how to achieve things" standpoint.
----
h2. *S1.1*: A 'SortableTextField' that builds docValues using the original text
input
h3. *S1.1G*: Goal
A new SortableTextField subclass would be added that would functionally work
the same as TextField except:
* {{docValues="true|false"}} could be configured, with the default being "true"
* The docValues would contain (a prefix of) the original input values (just
like StrField) for sorting (or faceting)
** By default, to protect users from excessively large docValues, only the
first 1024 of each field value would be used – but this could be overridden
with configuration.
h3. *S1.1E*: Example Usage
Consider the following sample configuration:
{code:java}
<field name="title" type="text_sortable" docValues="true"
indexed="true" docValues="true" stored="true" multiValued="false"/>
<fieldType name="text_sortable" class="solr.SortableTextField">
<analyzer type="index">
...
</analyzer>
<analyzer type="query">
...
</analyzer>
</fieldType>
{code}
Given a document with a title of "Solr In Action"
Users could:
* Search for individual (indexed) terms in the "title" field: {{q=title:solr}}
* Sort documents by title ( {{sort=title asc}} ) such that this document's
sort value would be "Solr In Action"
If another document had a "title" value that was longer then 1024 chars, then
the docValues would be built using only the first 1024 characters of the value
(unless the user modified the configuration)
NOTE: This would be functionally equivalent to the following existing
configuration - including the on disk index segments - except that the on disk
DocValues would refer directly to the "title" field, reducing the total number
of "field infos" in the index (which has a small impact on segment housekeeping
and merge times) and end users would not need to sort on an alternate
"title_string" field name - the original "title" field name would always be
used directly.
{code:java}
<field name="title" type="text"
indexed="true" docValues="true" stored="true" multiValued="false"/>
<field name="title_string" type="string"
indexed="false" docValues="true" stored="false" multiValued="false"/>
<copyField source="title" dest="title_string" maxChars="1024" />
{code}
h3. *S1.1A*: Suggested Approach (SOLR-11916)
While experimenting with a quick POC for this idea, I actually wound up
building a {{SortableTextField}} that is feature complete. See patch in
SOLR-11916.
NOTE: If/when *S1.3A* is implemented, this SortableTextField could be
refactored to be syntactic sugar for TextField w/ some added defaults – see
below.
----
h2. *S1.2*: A 'TermDocValuesTextField' that builds docValues using the
post-analysis terms
h3. *S1.2G*: Goal
A new TermDocValuesTextField subclass would be added that would functionally
work the same as TextField except:
* {{docValues="true|false"}} could be configured, with the default being "true"
* Instances of fields using this type would support faceting (or sorting),
using DocValues build from the terms produced by the "index" analyzer
** NOTE: Sorting on this type of field would only make sense in some special
circumstances depending on the analyzer used (ie: KeywordTokenizer)
h3. *S1.2E*: Example Usage
Consider the following sample configuration
{code:java}
<field name="keywords" type="text_facet" docValues="true"
indexed="true" docValues="true" stored="true" multiValued="true"/>
<fieldType name="text_facet" class="solr.TermDocValuesTextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="unicode"/>
...
</analyzer>
</fieldType>
<field name="author" type="text_lc_sort" docValues="true"
indexed="true" docValues="true" stored="true" multiValued="false"/>
<fieldType name="text_lc_sort" class="solr.TermDocValuesTextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
{code}
Given a document with an author of "Grainger, Trey" and keywords value of of
"book lucene solr"
Users could:
* Search for individual (indexed) terms in the "keywords" field:
q=keywords:book
* Facet on the keywords field (facet.field=keywords) such that if this were
the only document in the index, the facet counts would be "book=1, lucene=1,
solr=1"
* Sort documents by author (sort=title asc) such that this document's sort
value would be "grainger, trey"
NOTE: This should be functionally equivalent to users faceting on a "keywords"
TextField (or sorting on an "author" TextField using KeywordTokenizer) today,
except that the facet/sort values would come from DocValues (written at
indexing time), and not the FieldCache (built on the fly at query time and held
solely in RAM).
h3. *S1.2A*: Suggested Approach
* Add a new TermDocValuesTextField subclass of TextField
* if docValues="true":
** Augment the configured "index" analyzer to record each resulting token from
the stream in a Set
** When indexing, pre-analyze/buffer the token stream and use the recorded Set
of tokens to build additional SortedSetDocValuesField instances in the
underling indexed document
* OPTIMIZATION?: We may be able to avoid the pre-analysis/buffering of the
TokenStream and instead hook into the low level indexing code with a callback
to generate the SortedSetDocValuesField instances on the fly as the
DocumentsWriter reads from the (original) TokenStream ... needs
experimentation/refactoring once we have some tests.
NOTE: If/when *S1.3A* is implemented, this TermDocValuesTextField could be
refactored to be syntactic sugar for TextField w/ some added defaults – see
below.
> A Potential Roadmap for robust multi-analyzer TextFields w/various options
> for configuring docValues
> ----------------------------------------------------------------------------------------------------
>
> Key: SOLR-11917
> URL: https://issues.apache.org/jira/browse/SOLR-11917
> Project: Solr
> Issue Type: Wish
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Assignee: Hoss Man
> Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter
> field types" in Solr. In particular to think about:
> # How to simplify some of the "special things" people have to know about
> Solr behavior when creating their schemas
> # How to reduce the number of situations where users have to copy/clone one
> "logical field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that
> people seem to have - many of which are already tracked in existing jiras -
> along with a high level design/roadmap of potential solutions for these goals
> that can be implemented incrementally to leverage some common changes (and
> what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader
> community discussion, and as a central linkage point for the related jiras.
> (details to follow in a very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on
> implementing every aspect of these ideas :)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]