[
https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886061#action_12886061
]
Ken Krugler commented on TIKA-457:
----------------------------------
It's TagSoup that's generating the "interesting" output. Straight from a
TagSoup parser (without Tika), the above gives you:
{code}
<?xml version="1.0" encoding="UTF-8"?>
<html><head><title> my title </title></head><body/><frameset rows="20,*"><frame
frameborder="1" scrolling="auto" src="top.html"/><frameset cols="20,*"><frame
frameborder="1" scrolling="auto" src="left.html"/><frame frameborder="1"
scrolling="auto" src="invalid.html"/><frame frameborder="1" scrolling="auto"
src="right.html"/></frameset></frameset></html>
{code}
According to the XHTML 1.0 "frameset" DTD and the HTML 4.01 "frameset" DTD, the
<frameset> element should NOT be inside of a body tag, which is why you're
seeing the odd output.
I believe the issue here is that based on TagSoup's state machine architecture,
the <body> tag has been emitted by the time you get to the <frameset>. TagSoup
could hang onto the <body> tag until it sees something other than a <frameset>,
but that feels pretty extreme.
Side note - the HTML is slightly broken, in that <frame src=\"invalid.html\"/>
is followed by </frame>, but it's already been terminated by the "/>" sequence.
Don't know if that was intentional or not.
Also strictly speaking you can't have empty <frame> elements, which is what are
defined in the example. They should be <frame src="blah"> without a </frame>.
> HTMLParser gets an early </body> event
> --------------------------------------
>
> Key: TIKA-457
> URL: https://issues.apache.org/jira/browse/TIKA-457
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Julien Nioche
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\">
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body
> *BEFORE* we reach frameset. As a result the variable bodylevel is
> decremented back to 0 and the remaining entities are ignored due to the logic
> implemented in HTMLHandler.
> Any idea?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.