[jira] [Created] (SOLR-12518) PreAnalyzedField fails to index documents without tokens

Yuki Yano (JIRA) Tue, 26 Jun 2018 00:24:30 -0700

Yuki Yano created SOLR-12518:
--------------------------------

             Summary: PreAnalyzedField fails to index documents without tokens
                 Key: SOLR-12518
                 URL: https://issues.apache.org/jira/browse/SOLR-12518
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: update
            Reporter: Yuki Yano
         Attachments: SOLR-12518.patch


h1. Overview

{{PreAnalyzedField}} fails to index documents without tokens like the following 
data:
{code:java}
{
  "v": "1",
  "str": "foo",
  "tokens": []
}
{code}
h1. Details

{{PreAnalyzedField}} consumes field values which have been pre-analyzed in 
advance. The format of pre-analyzed value is like follows:
{code:java}
{
  "v":"1",
  "str":"test",
  "tokens": [
    {"t":"one","s":123,"e":128,"i":22,"p":"DQ4KDQsODg8=","y":"word"},
    {"t":"two","s":5,"e":8,"i":1,"y":"word"},
    {"t":"three","s":20,"e":22,"i":1,"y":"foobar"}
  ]
}
{code}
As [the 
document|https://lucene.apache.org/solr/guide/7_3/working-with-external-files-and-processes.html#WorkingwithExternalFilesandProcesses]
 mensions, {{"str"}} and {{"tokens"}} are optional, i.e., both an empty value 
and no key are allowed. However, when {{"tokens"}} is empty or not defined, 
{{PreAnalyzedField}} throws IOException and fails to index the document.

This error is related to the behavior of {{Field#tokenStream}}. This method 
tries to create {{TokenStream}} by following steps (NOTE: assume 
{{indexed=true}}):
 * If the field has {{tokenStream}} value, returns it.
 * Otherwise, creates {{tokenStream}} by parsing the stored value.

If pre-analyzed value doesn't have tokens, the second step will be executed. 
Unfortunately, since {{PreAnalyzedField}} always returns 
{{PreAnalyzedAnalyzer}} as the index analyzer and the stored value (i.e., the 
value of {{"str"}}) is not the pre-analyzed format, this step will fail due to 
the pre-analyzed format error (i.e., IOException).
h1. How to reproduce

1. Download latest solr package and prepare solr server according to [Solr 
Tutorial|http://lucene.apache.org/solr/guide/7_3/solr-tutorial.html].
 2. Add following fieldType and field to the schema.
{code:xml}
    <fieldType name="preanalyzed-with-analyzer" class="solr.PreAnalyzedField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
    <field name="pre_with_analyzer" type="preanalyzed-with-analyzer" 
indexed="true" stored="true" multiValued="false"/>
{code}
3. Index following documents and Solr will throw IOException.
{code:java}
// This is OK
{"id": 1, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document 
one\',\'tokens\':[{\'t\':\'one\'},{\'t\':\'two\'},{\'t\':\'three\',\'i\':100}]}"}

// Solr throws IOException
{"id": 2, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document two\', 
\'tokens\':[]}"}

// Solr throws IOException
{"id": 3, "pre_with_analyzer": "{\'v\':\'1\',\'str\':\'document three\'}"}
{code}
h1. How to fix

Because we don't need to analyze again if {{"tokens"}} is empty or not set, we 
can avoid this error by setting {{EmptyTokenStream}} as {{tokenStream}} instead 
like the following code:
{code:java}
parse.hasTokenStream() ? parse : new EmptyTokenStream()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-12518) PreAnalyzedField fails to index documents without tokens

Reply via email to