Kingston Duffie created SOLR-6097:
-------------------------------------
Summary: Posting JSON with < > results in lost information
Key: SOLR-6097
URL: https://issues.apache.org/jira/browse/SOLR-6097
Project: Solr
Issue Type: Bug
Affects Versions: 4.7.2
Reporter: Kingston Duffie
Post the following JSON to add a document:
{
"add" :
{
"commitWithin" : 5000,
"doc" :
{
"id" : "12345",
"body" : "a < b > c"
}
}
}
The body field is configured in the schema as:
<field name="body" type="text_hive" indexed="true" stored="true"
required="false" multiValued="false"/>
and
<fieldType name="text_hive" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The problem is this: After submitting this post, if you go to the SOLR console
and find this document, the stored body will be missing the contents between
the less-than and greater-than symbols -- i.e., "a c".
If you encode the body (i.e., "a < b > c"), it will show up with < and >
symbols. That is, it appears that SOLR is stripping out HTML tags even though
we are not asking it to.
Note that it is not only the storage but also indexing that is affected (as we
originally found the issue because searching for "b" would not match this
document.
I'm willing to believe that I'm doing something wrong, but I can't see anywhere
in any spec that suggests that strings inside JSON need to be
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]