[ 
https://issues.apache.org/jira/browse/SOLR-12518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Yano updated SOLR-12518:
-----------------------------
    Attachment: SOLR-12518.patch

> PreAnalyzedField fails to index documents without tokens
> --------------------------------------------------------
>
>                 Key: SOLR-12518
>                 URL: https://issues.apache.org/jira/browse/SOLR-12518
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update
>            Reporter: Yuki Yano
>            Priority: Minor
>         Attachments: SOLR-12518.patch
>
>
> h1. Overview
> {{PreAnalyzedField}} fails to index documents without tokens like the 
> following data:
> {code:java}
> {
>   "v": "1",
>   "str": "foo",
>   "tokens": []
> }
> {code}
> h1. Details
> {{PreAnalyzedField}} consumes field values which have been pre-analyzed in 
> advance. The format of pre-analyzed value is like follows:
> {code:java}
> {
>   "v":"1",
>   "str":"test",
>   "tokens": [
>     {"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
>     {"t":"two","s":5,"e":8,"i":1,"y":"word"},
>     {"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
>   ]
> }
> {code}
> As [the 
> document|https://lucene.apache.org/solr/guide/7_3/working-with-external-files-and-processes.html#WorkingwithExternalFilesandProcesses]
>  mensions, {{"str"}} and {{"tokens"}} are optional, i.e., both an empty value 
> and no key are allowed. However, when {{"tokens"}} is empty or not defined, 
> {{PreAnalyzedField}} throws IOException and fails to index the document.
> This error is related to the behavior of {{Field#tokenStream}}. This method 
> tries to create {{TokenStream}} by following steps (NOTE: assume 
> {{indexed=true}}):
>  * If the field has {{tokenStream}} value, returns it.
>  * Otherwise, creates {{tokenStream}} by parsing the stored value.
> If pre-analyzed value doesn't have tokens, the second step will be executed. 
> Unfortunately, since {{PreAnalyzedField}} always returns 
> {{PreAnalyzedAnalyzer}} as the index analyzer and the stored value (i.e., the 
> value of {{"str"}}) is not the pre-analyzed format, this step will fail due 
> to the pre-analyzed format error (i.e., IOException).
> h1. How to reproduce
> 1. Download latest solr package and prepare solr server according to [Solr 
> Tutorial|http://lucene.apache.org/solr/guide/7_3/solr-tutorial.html].
>  2. Add following fieldType and field to the schema.
> {code:xml}
>     <fieldType name="preanalyzed-with-analyzer" class="solr.PreAnalyzedField">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       </analyzer>
>     </fieldType>
>     <field name="pre_with_analyzer" type="preanalyzed-with-analyzer" 
> indexed="true" stored="true" multiValued="false"/>
> {code}
> 3. Index following documents and Solr will throw IOException.
> {code:java}
> // This is OK
> {"id": 1, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document 
> one\',\'tokens\':[{\'t\':\'one\'},{\'t\':\'two\'},{\'t\':\'three\',\'i\':100}]}"}
> // Solr throws IOException
> {"id": 2, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document two\', 
> \'tokens\':[]}"}
> // Solr throws IOException
> {"id": 3, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document three\'}"}
> {code}
> h1. How to fix
> Because we don't need to analyze again if {{"tokens"}} is empty or not set, 
> we can avoid this error by setting {{EmptyTokenStream}} as {{tokenStream}} 
> instead like the following code:
> {code:java}
> parse.hasTokenStream() ? parse : new EmptyTokenStream()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to