[jira] Updated: (LUCENE-259) HTML Parser doesn't decode character references in attributes

Daniel Naber (JIRA) Thu, 15 Jun 2006 15:09:08 -0700

     [ http://issues.apache.org/jira/browse/LUCENE-259?page=all ]


Daniel Naber updated LUCENE-259:
--------------------------------

    Bugzilla Id:   (was: 30621)
      Assign To:     (was: Lucene Developers)
       Priority: Minor  (was: Major)

Decrease priority because this affects the demo only.


> HTML Parser doesn't decode character references in attributes
> -------------------------------------------------------------
>
>          Key: LUCENE-259
>          URL: http://issues.apache.org/jira/browse/LUCENE-259
>      Project: Lucene - Java
>         Type: Bug

>   Components: Examples
>     Versions: 1.4
>  Environment: Operating System: All
> Platform: All
>     Reporter: Dave Sparks
>     Priority: Minor

>
> The HTML Parser includes the values of certain attributes in the summary, the
> metaTags and the output stream.  Character references in the attribute values
> are not decoded.  Specifically:
> 1. The value of the alt= attribute of an <img ...> tag is included in the
> summary and the output stream.  This value is case-significant, and may 
> include
> character references.  The character references are not decoded.
> 2. The value of the content= attribute of a <meta ...> tag is included in the
> metaTags if the tag also has a name= or http-equiv= attribute.  This value is
> case-significant, and may include character references.  The character
> references are not decoded, and the value is downcased (since the fix to bug
> #27423).
> I've patched our version of the parser to decode the character references, by
> adding a decodeAll method to Entities to parse a String for character 
> references
> and return a String where the references have been replaced by the 
> corresponding
> characters (or the original String, if no change is needed).  This method is
> called to decode alt= attributes and content= attributes.  I've removed the
> .toLowerCase() on the content= value.  I'm not really happy with this fix, as 
> it
> seems to me to be wrong to parse a value which was previously parsed as a 
> single
> token; there ought to be a way to get it right the first time.
> I've left the name= and http-equiv= values alone.  It's not entirely clear (to
> me) whether character references are allowed, and it would be perverse to use
> them here.  I also appreciate the convenience of having a single combined
> namespace, with downcased names, even though this is technically wrong.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-259) HTML Parser doesn't decode character references in attributes

Reply via email to