[jira] [Commented] (SOLR-2597) XmlCharFilter

Robert Muir (JIRA) Wed, 15 Jun 2011 19:36:57 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050189#comment-13050189
 ]


Robert Muir commented on SOLR-2597:
-----------------------------------

just one comment: taking a look at the patch, it currently won't compile 
because the analysis module has no dependencies and thus no woodstox or 
whatever.
(but, thanks for trying to integrate it here!!!)

One step would be, rather than have this thing static, can we just have the 
ctors to this thing take a general XMLInputFactory instead, e.g.
{noformat}
public XmlCharFilter (CharStream reader, XMLInputFactory inputFactory) {
{noformat}

The corresponding Solr CharFilterFactory could then configure it with all the 
woodstox-specific parameters.
But, this still wouldn't solve the issue that all of lucene and modules are on 
java5 (and it looks like this uses java6-specific APIs).

I don't think it makes sense to block the patch for these issues, so one 
workaround would be to just add it to Solr-only.
If/when we ever move to java 6 in lucene we could then move it into the 
analysis module.
Another option would be if the XML policeman knows some workaround (sorry, not 
my thing).


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping 
> all non-text content and remembering offsets, just like HTMLCharFilter, but 
> respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter 
> also provides the ability to exclude (and include) the content of certain 
> named elements.
> In order to compute character offsets properly when mixed line termination 
> styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, 
> &amp;) are present, we require a newer version of Woodstox (4.1.1) than is 
> currently in solr/lib.  The earlier versions of the parser could not report 
> these entity events, so we couldn't tell the difference between "<" and 
> "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2597) XmlCharFilter

Reply via email to