[ http://issues.apache.org/jira/browse/LUCENE-259?page=all ]
Daniel Naber updated LUCENE-259:
--------------------------------
Bugzilla Id: (was: 30621)
Assign To: (was: Lucene Developers)
Priority: Minor (was: Major)
Decrease priority because this affects the demo only.
> HTML Parser doesn't decode character references in attributes
> -------------------------------------------------------------
>
> Key: LUCENE-259
> URL: http://issues.apache.org/jira/browse/LUCENE-259
> Project: Lucene - Java
> Type: Bug
> Components: Examples
> Versions: 1.4
> Environment: Operating System: All
> Platform: All
> Reporter: Dave Sparks
> Priority: Minor
>
> The HTML Parser includes the values of certain attributes in the summary, the
> metaTags and the output stream. Character references in the attribute values
> are not decoded. Specifically:
> 1. The value of the alt= attribute of an <img ...> tag is included in the
> summary and the output stream. This value is case-significant, and may
> include
> character references. The character references are not decoded.
> 2. The value of the content= attribute of a <meta ...> tag is included in the
> metaTags if the tag also has a name= or http-equiv= attribute. This value is
> case-significant, and may include character references. The character
> references are not decoded, and the value is downcased (since the fix to bug
> #27423).
> I've patched our version of the parser to decode the character references, by
> adding a decodeAll method to Entities to parse a String for character
> references
> and return a String where the references have been replaced by the
> corresponding
> characters (or the original String, if no change is needed). This method is
> called to decode alt= attributes and content= attributes. I've removed the
> .toLowerCase() on the content= value. I'm not really happy with this fix, as
> it
> seems to me to be wrong to parse a value which was previously parsed as a
> single
> token; there ought to be a way to get it right the first time.
> I've left the name= and http-equiv= values alone. It's not entirely clear (to
> me) whether character references are allowed, and it would be perverse to use
> them here. I also appreciate the convenience of having a single combined
> namespace, with downcased names, even though this is technically wrong.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]