[
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-2100:
------------------------------
Fix Version/s: 2.0.0
1.19
> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Gerard Bouchar
> Priority: Major
> Fix For: 1.19, 2.0.0
>
>
> Parsing a very simple html like
> <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html>
> you won't be able to access the html tag's attributes (here lang="en") in the
> ContentHandler :
> *in the method startElement(String ns, String localName, String name,
> Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the
> HtmlMapper.mapSafeAttribute method too.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)