[jira] [Commented] (SOLR-6097) Posting JSON with < > results in lost information

Stefan Matheis (steffkes) (JIRA) Tue, 20 May 2014 12:10:47 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003845#comment-14003845
 ]


Stefan Matheis (steffkes) commented on SOLR-6097:
-------------------------------------------------

you didn't link those issues .. but since i saw SOLR-6098 after that one and 
already commented on that .. i guess they are related?

> Posting JSON with < > results in lost information
> -------------------------------------------------
>
>                 Key: SOLR-6097
>                 URL: https://issues.apache.org/jira/browse/SOLR-6097
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.7.2
>            Reporter: Kingston Duffie
>
> Post the following JSON to add a document:
> { 
>     "add" : 
>        { 
>            "commitWithin" : 5000,
>            "doc" : 
>                {  
>                    "id" : "12345",
>                    "body" : "a < b > c"
>                }
>         }
> }
> The body field is configured in the schema as:
>    <field name="body" type="text_hive" indexed="true" stored="true" 
> required="false" multiValued="false"/>
> and
>     <fieldType name="text_hive" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" generateNumberParts="1" catenateWords="1" 
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" 
> preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>               <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
> maxGramSize="15" side="front"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>               <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" generateNumberParts="1" catenateWords="1" 
> catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" 
> preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> The problem is this:  After submitting this post, if you go to the SOLR 
> console and find this document, the stored body will be missing the contents 
> between the less-than and greater-than symbols -- i.e., "a c".  
> If you encode the body (i.e.,  "a &lt; b &gt; c"), it will show up with < and 
> > symbols.  That is, it appears that SOLR is stripping out HTML tags even 
> though we are not asking it to.
> Note that it is not only the storage but also indexing that is affected (as 
> we originally found the issue because searching for "b" would not match this 
> document.
> I'm willing to believe that I'm doing something wrong, but I can't see 
> anywhere in any spec that suggests that strings inside JSON need to be 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6097) Posting JSON with < > results in lost information

Reply via email to