[jira] [Commented] (SOLR-2597) XmlCharFilter

Hoss Man (JIRA) Wed, 15 Jun 2011 19:12:55 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050184#comment-13050184
 ]


Hoss Man commented on SOLR-2597:
--------------------------------

Mike: thanks for the patch!

as Koji mentioned on the mailing list, might want to consider naming this 
XmlStripCharFilter ... that was my first opinion, but reading the docs the 
"include" and "exclude" options definitely make it a bit more generic, so i'm 
leaning towards the opinion that XmlCharFilter is better.

(there's an argument to be made that we should have an XmlStripCharFilter that 
only removes pi/comments/whitespace and resolves entities, and then a distinct 
XmlTagCharFilter that does the include/exclude -- but i'm guessing that would 
be less efficient since this makes it possible to do in one pass, and anyone 
who wants include/exclude at the "tag" level is almost certainly going to want 
the striping/entities as well)

skiming the patch i'm +1 except for the "new Random" in the test case ... if 
you take a look at the existing test cases you'll see how you can hook into the 
solr test framework to get random values that are consistent with a global seed 
-- that way if a test fails, it will report which seed was used and people can 
reproduce it using system properties.

would also be nice to have a test of the Factory (using a schema.xml 
declaration) but that's not nearly as important.

and of course: would be great if "the xml policeman" uwe could review.

bq. I tried to include the upgraded Woodstox jars, but I don't think I figured 
how to put binaries in the patch actually.

it's not possible, so don't worry about it.  the important thing is noting in a 
comment (like you did) exactly what new/upgraded jars are needed.


> XmlCharFilter
> -------------
>
>                 Key: SOLR-2597
>                 URL: https://issues.apache.org/jira/browse/SOLR-2597
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 4.0
>            Reporter: Mike Sokolov
>         Attachments: SOLR-2597.patch
>
>
> This CharFilter processes incoming XML using the Woodstox parser, stripping 
> all non-text content and remembering offsets, just like HTMLCharFilter, but 
> respecting XML conventions like XML entities defined in a DTD.  XmlCharFilter 
> also provides the ability to exclude (and include) the content of certain 
> named elements.
> In order to compute character offsets properly when mixed line termination 
> styles are present (\r, \r\n), or when XML character entities (&lt;, &quot;, 
> &amp;) are present, we require a newer version of Woodstox (4.1.1) than is 
> currently in solr/lib.  The earlier versions of the parser could not report 
> these entity events, so we couldn't tell the difference between "<" and 
> "&lt;" and the offsets could be wrong.  The upgraded version is in the patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2597) XmlCharFilter

Reply via email to