[jira] Closed: (LUCENE-259) HTML Parser doesn't decode character references in attributes

Shai Erera (JIRA) Thu, 27 Jan 2011 02:20:12 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shai Erera closed LUCENE-259.
-----------------------------

    Resolution: Won't Fix

Very long inactivity and the HtmlParser in demo has many problems in general -- 
I don't think we intend to have a fully working HtmlParser in our code, it was 
intended for demo purposes only.

> HTML Parser doesn't decode character references in attributes
> -------------------------------------------------------------
>
>                 Key: LUCENE-259
>                 URL: https://issues.apache.org/jira/browse/LUCENE-259
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Examples
>    Affects Versions: 1.4
>         Environment: Operating System: All
> Platform: All
>            Reporter: Dave Sparks
>            Priority: Minor
>
> The HTML Parser includes the values of certain attributes in the summary, the
> metaTags and the output stream.  Character references in the attribute values
> are not decoded.  Specifically:
> 1. The value of the alt= attribute of an <img ...> tag is included in the
> summary and the output stream.  This value is case-significant, and may 
> include
> character references.  The character references are not decoded.
> 2. The value of the content= attribute of a <meta ...> tag is included in the
> metaTags if the tag also has a name= or http-equiv= attribute.  This value is
> case-significant, and may include character references.  The character
> references are not decoded, and the value is downcased (since the fix to bug
> #27423).
> I've patched our version of the parser to decode the character references, by
> adding a decodeAll method to Entities to parse a String for character 
> references
> and return a String where the references have been replaced by the 
> corresponding
> characters (or the original String, if no change is needed).  This method is
> called to decode alt= attributes and content= attributes.  I've removed the
> .toLowerCase() on the content= value.  I'm not really happy with this fix, as 
> it
> seems to me to be wrong to parse a value which was previously parsed as a 
> single
> token; there ought to be a way to get it right the first time.
> I've left the name= and http-equiv= values alone.  It's not entirely clear (to
> me) whether character references are allowed, and it would be perverse to use
> them here.  I also appreciate the convenience of having a single combined
> namespace, with downcased names, even though this is technically wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Closed: (LUCENE-259) HTML Parser doesn't decode character references in attributes

Reply via email to