Gerard Bouchar created TIKA-2100:
------------------------------------
Summary: Html Parser does not keep the html tag attributes
Key: TIKA-2100
URL: https://issues.apache.org/jira/browse/TIKA-2100
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.13
Reporter: Gerard Bouchar
Parsing a very simple html like
<!DOCTYPE html>
<html lang="en">
<head>
<title>Page Title</title>
</head>
<body>
<h1 align="left">My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
you won't be able to access the html tag's attributes (here lang="en") in the
ContentHandler :
*in the method startElement(String ns, String localName, String name,
Attributes atts), atts is empty.
*Moreover it seems that the html tag's attributes are not passed trough the
HtmlMapper.mapSafeAttribute method too.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)