Gerard Bouchar created TIKA-2100:
------------------------------------

             Summary: Html Parser does not keep the html tag attributes
                 Key: TIKA-2100
                 URL: https://issues.apache.org/jira/browse/TIKA-2100
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.13
            Reporter: Gerard Bouchar


Parsing a very simple html like 
 <!DOCTYPE html>
<html lang="en">
<head>
<title>Page Title</title>
</head>
<body>

<h1 align="left">My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html> 

you won't be able to access the html tag's attributes (here lang="en") in the 
ContentHandler : 
*in the method startElement(String ns, String localName, String name,
      Attributes atts), atts is empty.
*Moreover it seems that the html tag's attributes are not passed trough the 
HtmlMapper.mapSafeAttribute method too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to