Re: [jira] [Created] (SOLR-2597) XmlCharFilter

Koji Sekiguchi Wed, 15 Jun 2011 06:29:55 -0700

Did you mean Xml*Strip*CharFilter?

koji
--
http://www.rondhuit.com/en/


(11/06/15 22:12), Mike Sokolov (JIRA) wrote:

XmlCharFilter
-------------

                  Key: SOLR-2597
                  URL: https://issues.apache.org/jira/browse/SOLR-2597
              Project: Solr
           Issue Type: Improvement
           Components: Schema and Analysis
     Affects Versions: 4.0
             Reporter: Mike Sokolov


This CharFilter processes incoming XML using the Woodstox parser, stripping all 
non-text content and remembering offsets, just like HTMLCharFilter, but 
respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter 
also provides the ability to exclude (and include) the content of certain named 
elements.

In order to compute character offsets properly when mixed line termination styles are present (\r, \r\n), or when XML 
character entities (&lt;,&quot;,&amp;) are present, we require a newer version of Woodstox (4.1.1) than is 
currently in solr/lib.  The earlier versions of the parser could not report these entity events, so we couldn't tell 
the difference between "<" and"&lt;" and the offsets could be wrong.  The upgraded version 
is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] [Created] (SOLR-2597) XmlCharFilter

Reply via email to