Kingston Duffie created SOLR-6097:
-------------------------------------

             Summary: Posting JSON with < > results in lost information
                 Key: SOLR-6097
                 URL: https://issues.apache.org/jira/browse/SOLR-6097
             Project: Solr
          Issue Type: Bug
    Affects Versions: 4.7.2
            Reporter: Kingston Duffie


Post the following JSON to add a document:

{ 
    "add" : 
       { 
           "commitWithin" : 5000,
           "doc" : 
               {  
                   "id" : "12345",
                   "body" : "a < b > c"
               }
        }
}

The body field is configured in the schema as:

   <field name="body" type="text_hive" indexed="true" stored="true" 
required="false" multiValued="false"/>

and

    <fieldType name="text_hive" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="15" side="front"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>


The problem is this:  After submitting this post, if you go to the SOLR console 
and find this document, the stored body will be missing the contents between 
the less-than and greater-than symbols -- i.e., "a c".  

If you encode the body (i.e.,  "a &lt; b &gt; c"), it will show up with < and > 
symbols.  That is, it appears that SOLR is stripping out HTML tags even though 
we are not asking it to.

Note that it is not only the storage but also indexing that is affected (as we 
originally found the issue because searching for "b" would not match this 
document.

I'm willing to believe that I'm doing something wrong, but I can't see anywhere 
in any spec that suggests that strings inside JSON need to be 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to