[jira] [Comment Edited] (SOLR-1535) Pre-analyzed field type

John Berryman (JIRA) Wed, 20 Mar 2013 12:01:19 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608011#comment-13608011
 ]


John Berryman edited comment on SOLR-1535 at 3/20/13 7:00 PM:
--------------------------------------------------------------

Ah, I see. This is a bit lower level than I was thinking. Still useful, but 
different. I was thinking about having PreAnalyzedField extend directly from 
TextField rather than from FieldType, and then be able to build up whatever 
analysis chain that you want in the usual TextField sense. Query analysis would 
proceed as with a normal TextField, but index analysis would smart detect 
whether this input was already parsed or not. If the input was not parsed, then 
it would go through the normal analysis. On the other hand, if the input was 
already parsed, then the token stream would go straight into the index (the 
assumption being that someone upstream understands what they're doing).

This way, in the SolrJ client you could build up some extra functionality so 
that the PreAnalyzedTextFields would be parsed client side and sent to Solr. In 
my current application, we have one Solr and N-indexers on different machines. 
The setup described here would take a big load off of Solr. The other benefit 
of this setup is that query analysis proceeds as it always does. I don't 
understand how someone would search over a PreAnalyzed field as it currently 
stands, without a bit of extra work/custom code on the client.

One pitfall to my idea is that you'd have to create a similar 
PreAnalyzedIntField, PreAnalyzedLocationField, PreAnalyzedDateField etc. I wish 
Java had mixins or multiple inheritance.

Thoughts?
                
      was (Author: berryman):
    Ah, I see. This is a bit lower level than I was thinking. Still useful, but 
different. I was thinking about having PreAnalyzedField extend directly from 
TextField rather than from FieldType, and then be able to build up whatever 
analysis chain that you want in the usual TextField sense. Query analysis would 
proceed as with a normal TextField, but index analysis would smart detect 
whether this input was already parsed or not. If the input was not parsed, then 
it would go through the normal analysis. On the other hand, if the input was 
already parsed, then the token stream would go straight into the index (the 
assumption being that someone upstream understands what they're doing).

This way, in the SolrJ client you could build up some extra functionality so 
that the PreAnalyzedTextFields would be parsed client side and sent to Solr. In 
my current application, we have one Solr and N-indexers on different machines. 
The setup described here would take a big load off of Solr. The other benefit 
of this setup is that query analysis proceeds as it always does. I don't 
understand how someone would search over a PreAnalyzed field as it currently 
stands, without a bit of extra work/custom code on the client.

Thoughts?
                  
> Pre-analyzed field type
> -----------------------
>
>                 Key: SOLR-1535
>                 URL: https://issues.apache.org/jira/browse/SOLR-1535
>             Project: Solr
>          Issue Type: New Feature
>    Affects Versions: 1.5
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: 4.0-ALPHA
>
>         Attachments: preanalyzed.patch, preanalyzed.patch, SOLR-1535.patch, 
> SOLR-1535.patch, SOLR-1535.patch
>
>
> PreAnalyzedFieldType provides a functionality to index (and optionally store) 
> content that was already processed and split into tokens using some external 
> processing chain. This implementation defines a serialization format for 
> sending tokens with any currently supported Attributes (eg. type, posIncr, 
> payload, ...). This data is de-serialized into a regular TokenStream that is 
> returned in Field.tokenStreamValue() and thus added to the index as index 
> terms, and optionally a stored part that is returned in Field.stringValue() 
> and is then added as a stored value of the field.
> This field type is useful for integrating Solr with existing text-processing 
> pipelines, such as third-party NLP systems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-1535) Pre-analyzed field type

Reply via email to