[
https://issues.apache.org/jira/browse/SOLR-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901306#action_12901306
]
Joan Codina commented on SOLR-1997:
-----------------------------------
With respect to Solr-1535 from what I understand, it allows to load data
externally generated in a given format that is not processed by Solr but
indexed as desired. This is slightly different as we do process it with solr
but store it
after processing not before (as usually Solr does)
With Solr-314 I think the idea here is much simpler: To store something
different that is in the input, but using always the Solr existing analyzers
the idea is that the ouptut of one analyzer is used as the input of a field. As
the field stores the input as is, the output of the analyzer is stored.
Why? well, for many reasons.: for example it text includes Payloads, we don't
want to show them. Or if we remove some labels...
We can decide to do half of the processing with the previous analyzer and then
do some extra processing in the field. But in this way we can control what we
store and what we index.
I think that are a few lines of code that add functionality to the schema, so
once integrated users don't need to program.
> analyzed field: Store internal value instead of input one
> ---------------------------------------------------------
>
> Key: SOLR-1997
> URL: https://issues.apache.org/jira/browse/SOLR-1997
> Project: Solr
> Issue Type: New Feature
> Affects Versions: 1.4, 1.4.1, 1.5
> Reporter: Joan Codina
> Fix For: 1.4, 1.4.1, 1.5
>
> Attachments: SOLR-1997-1.4.patch, SOLR-1997-1.5.patch
>
>
> Solr implements a set of filters and tokenizers that allow the filtering and
> treatment of text, but when the field is set to be stored, the text stored is
> the input one. This is may useful when the end user reads the input, but may
> not be like this in others, cases, when for example there are payloads and
> the text is something like A|2.0 good|1.0 day|3.0, or if the result of a
> query is processed using something like Carrot2
> So this is a simple new kind of field that takes as input the output of a
> given type (source), and then performs the normal processing with the desired
> tokenizers and filters . The difference is that the stored value is the
> output of the source type, and this is what is retrieved when getting the
> document.
> The name of the field type is AnalyzedField and in the schema is introduced
> in the following way to create the analyzedSourceType from the SourceType
> <fieldType name="SourceType" class="solr.TextField" >
> <analyzer type="index">
> <tokenizer
> class="solr.StandardTokenizerFactory" />
> <filter class......." />
> </analyzer>
> <analyzer type="query">
> <tokenizer
> class="solr.StandardTokenizerFactory" />
> <filter ....." />
> </analyzer>
> </fieldType>
> <fieldType name="analyzedSoureType" class="solr.AnalyzedField"
> positionIncrementGap="100" preProcessType="SourceType">
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> </analyzer>
> </fieldType>
> many times just the WhitespaceTokenizerFactory is needed as the tokens have
> already been cut down by the SourceType
> finally, a field can be declared as
> <field name="analyzedData" type="analyzedSoureType" indexed="true"
> stored="true" termVectors="true" multiValued="true"/>
> which can be written directly or can be defined as a copy of the source one.
> <field name="Data" type="analyzedSoureType" indexed="true" stored="true"
> termVectors="true" multiValued="true"/>
> ...
> <copyField source=data" dest="analyzedData"/>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]