XmlCharFilter
-------------

                 Key: SOLR-2597
                 URL: https://issues.apache.org/jira/browse/SOLR-2597
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
    Affects Versions: 4.0
            Reporter: Mike Sokolov


This CharFilter processes incoming XML using the Woodstox parser, stripping all 
non-text content and remembering offsets, just like HTMLCharFilter, but 
respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter 
also provides the ability to exclude (and include) the content of certain named 
elements.

In order to compute character offsets properly when mixed line termination 
styles are present (\r, \r\n), or when XML character entities (<, ", 
&) are present, we require a newer version of Woodstox (4.1.1) than is 
currently in solr/lib.  The earlier versions of the parser could not report 
these entity events, so we couldn't tell the difference between "<" and "&lt;" 
and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to