[
https://issues.apache.org/jira/browse/TIKA-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490957#comment-16490957
]
ASF GitHub Bot commented on TIKA-2100:
--------------------------------------
tballison commented on a change in pull request #238: TIKA-2100 extract content
language from html lang attribute
URL: https://github.com/apache/tika/pull/238#discussion_r190945058
##########
File path: tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java
##########
@@ -138,7 +138,12 @@ private void lazyStartHead() throws SAXException {
// Call directly, so we don't go through our startElement(), which
will
// ignore these elements.
- super.startElement(XHTML, "html", "html", EMPTY_ATTRIBUTES);
+ AttributesImpl htmlAttrs = new AttributesImpl();
Review comment:
Sorry...I'm so narrowly focused only on my own typical use case - extract
text without structure, put the lang in Metadata()...
To confirm, you have a need to include the 'lang=' in the xhtml output from
Tika? Are there other common attributes of HTML that we should also include?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Html Parser does not keep the html tag attributes
> -------------------------------------------------
>
> Key: TIKA-2100
> URL: https://issues.apache.org/jira/browse/TIKA-2100
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Gerard Bouchar
> Priority: Major
>
> Parsing a very simple html like
> <!DOCTYPE html>
> <html lang="en">
> <head>
> <title>Page Title</title>
> </head>
> <body>
> <h1 align="left">My First Heading</h1>
> <p>My first paragraph.</p>
> </body>
> </html>
> you won't be able to access the html tag's attributes (here lang="en") in the
> ContentHandler :
> *in the method startElement(String ns, String localName, String name,
> Attributes atts), atts is empty.
> *Moreover it seems that the html tag's attributes are not passed trough the
> HtmlMapper.mapSafeAttribute method too.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)