Kingston Duffie created SOLR-6097: ------------------------------------- Summary: Posting JSON with < > results in lost information Key: SOLR-6097 URL: https://issues.apache.org/jira/browse/SOLR-6097 Project: Solr Issue Type: Bug Affects Versions: 4.7.2 Reporter: Kingston Duffie
Post the following JSON to add a document: { "add" : { "commitWithin" : 5000, "doc" : { "id" : "12345", "body" : "a < b > c" } } } The body field is configured in the schema as: <field name="body" type="text_hive" indexed="true" stored="true" required="false" multiValued="false"/> and <fieldType name="text_hive" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> The problem is this: After submitting this post, if you go to the SOLR console and find this document, the stored body will be missing the contents between the less-than and greater-than symbols -- i.e., "a c". If you encode the body (i.e., "a < b > c"), it will show up with < and > symbols. That is, it appears that SOLR is stripping out HTML tags even though we are not asking it to. Note that it is not only the storage but also indexing that is affected (as we originally found the issue because searching for "b" would not match this document. I'm willing to believe that I'm doing something wrong, but I can't see anywhere in any spec that suggests that strings inside JSON need to be -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org