[
https://issues.apache.org/jira/browse/LUCENE-5734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14019355#comment-14019355
]
Steve Rowe commented on LUCENE-5734:
------------------------------------
bq. FYI it behaves as I expect if after hello is an XML entity such as in this
example: {{hello }}
I should point out that HTMLStripCharFilter only accepts the named character
entities defined in [the HTML 4.0
spec|http://www.w3.org/TR/REC-html40/sgml/entities.html], which happens to
include the predefined XML entities ({{>}}, {{<}},
{{"}}, {{'}}, and {{&}}). {{ }} is
specifically *not* an XML entity, at least not as understood by
HTMLStripCharFilter. I mention this only to point out that HTMLStripCharFilter
doesn't parse XML for entity declarations, and will not honor them if they
appear.
> HTMLStripCharFilter end offset should be left of closing tags
> -------------------------------------------------------------
>
> Key: LUCENE-5734
> URL: https://issues.apache.org/jira/browse/LUCENE-5734
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
>
> Consider this simple input:
> {noformat}
> <em>hello</em>
> {noformat}
> to be analyzed by HTMLStripCharFilter and WhitespaceTokenizer.
> You get back one token for "hello". Good. The start offset of this token is
> at the position of 'h' -- good. But the end offset is surprisingly plus one
> to the adjacent </em>. I argue that it should be plus one to the last
> character of the token (following 'o').
> FYI it behaves as I expect if after hello is an XML entity such as in this
> example: {noformat}hello {noformat} The end offset immediately follows
> the 'o'.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]