[
https://issues.apache.org/jira/browse/LUCENE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shai Erera closed LUCENE-259.
-----------------------------
Resolution: Won't Fix
Very long inactivity and the HtmlParser in demo has many problems in general --
I don't think we intend to have a fully working HtmlParser in our code, it was
intended for demo purposes only.
> HTML Parser doesn't decode character references in attributes
> -------------------------------------------------------------
>
> Key: LUCENE-259
> URL: https://issues.apache.org/jira/browse/LUCENE-259
> Project: Lucene - Java
> Issue Type: Bug
> Components: Examples
> Affects Versions: 1.4
> Environment: Operating System: All
> Platform: All
> Reporter: Dave Sparks
> Priority: Minor
>
> The HTML Parser includes the values of certain attributes in the summary, the
> metaTags and the output stream. Character references in the attribute values
> are not decoded. Specifically:
> 1. The value of the alt= attribute of an <img ...> tag is included in the
> summary and the output stream. This value is case-significant, and may
> include
> character references. The character references are not decoded.
> 2. The value of the content= attribute of a <meta ...> tag is included in the
> metaTags if the tag also has a name= or http-equiv= attribute. This value is
> case-significant, and may include character references. The character
> references are not decoded, and the value is downcased (since the fix to bug
> #27423).
> I've patched our version of the parser to decode the character references, by
> adding a decodeAll method to Entities to parse a String for character
> references
> and return a String where the references have been replaced by the
> corresponding
> characters (or the original String, if no change is needed). This method is
> called to decode alt= attributes and content= attributes. I've removed the
> .toLowerCase() on the content= value. I'm not really happy with this fix, as
> it
> seems to me to be wrong to parse a value which was previously parsed as a
> single
> token; there ought to be a way to get it right the first time.
> I've left the name= and http-equiv= values alone. It's not entirely clear (to
> me) whether character references are allowed, and it would be perverse to use
> them here. I also appreciate the convenience of having a single combined
> namespace, with downcased names, even though this is technically wrong.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]