DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=30621>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=30621 HTML Parser doesn't decode character references in attributes Summary: HTML Parser doesn't decode character references in attributes Product: Lucene Version: 1.4 Platform: All OS/Version: All Status: NEW Severity: Normal Priority: Other Component: Examples AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] The HTML Parser includes the values of certain attributes in the summary, the metaTags and the output stream. Character references in the attribute values are not decoded. Specifically: 1. The value of the alt= attribute of an <img ...> tag is included in the summary and the output stream. This value is case-significant, and may include character references. The character references are not decoded. 2. The value of the content= attribute of a <meta ...> tag is included in the metaTags if the tag also has a name= or http-equiv= attribute. This value is case-significant, and may include character references. The character references are not decoded, and the value is downcased (since the fix to bug #27423). I've patched our version of the parser to decode the character references, by adding a decodeAll method to Entities to parse a String for character references and return a String where the references have been replaced by the corresponding characters (or the original String, if no change is needed). This method is called to decode alt= attributes and content= attributes. I've removed the .toLowerCase() on the content= value. I'm not really happy with this fix, as it seems to me to be wrong to parse a value which was previously parsed as a single token; there ought to be a way to get it right the first time. I've left the name= and http-equiv= values alone. It's not entirely clear (to me) whether character references are allowed, and it would be perverse to use them here. I also appreciate the convenience of having a single combined namespace, with downcased names, even though this is technically wrong. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]