[ https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050189#comment-13050189 ]
Robert Muir commented on SOLR-2597: ----------------------------------- just one comment: taking a look at the patch, it currently won't compile because the analysis module has no dependencies and thus no woodstox or whatever. (but, thanks for trying to integrate it here!!!) One step would be, rather than have this thing static, can we just have the ctors to this thing take a general XMLInputFactory instead, e.g. {noformat} public XmlCharFilter (CharStream reader, XMLInputFactory inputFactory) { {noformat} The corresponding Solr CharFilterFactory could then configure it with all the woodstox-specific parameters. But, this still wouldn't solve the issue that all of lucene and modules are on java5 (and it looks like this uses java6-specific APIs). I don't think it makes sense to block the patch for these issues, so one workaround would be to just add it to Solr-only. If/when we ever move to java 6 in lucene we could then move it into the analysis module. Another option would be if the XML policeman knows some workaround (sorry, not my thing). > XmlCharFilter > ------------- > > Key: SOLR-2597 > URL: https://issues.apache.org/jira/browse/SOLR-2597 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Affects Versions: 4.0 > Reporter: Mike Sokolov > Attachments: SOLR-2597.patch > > > This CharFilter processes incoming XML using the Woodstox parser, stripping > all non-text content and remembering offsets, just like HTMLCharFilter, but > respecting XML conventions like XML entities defined in a DTD. XmlCharFilter > also provides the ability to exclude (and include) the content of certain > named elements. > In order to compute character offsets properly when mixed line termination > styles are present (\r, \r\n), or when XML character entities (<, ", > &) are present, we require a newer version of Woodstox (4.1.1) than is > currently in solr/lib. The earlier versions of the parser could not report > these entity events, so we couldn't tell the difference between "<" and > "<" and the offsets could be wrong. The upgraded version is in the patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org