[ 
https://issues.apache.org/jira/browse/TIKA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886061#action_12886061
 ] 

Ken Krugler commented on TIKA-457:
----------------------------------

It's TagSoup that's generating the "interesting" output. Straight from a 
TagSoup parser (without Tika), the above gives you:

{code}
<?xml version="1.0" encoding="UTF-8"?>
<html><head><title> my title </title></head><body/><frameset rows="20,*"><frame 
frameborder="1" scrolling="auto" src="top.html"/><frameset cols="20,*"><frame 
frameborder="1" scrolling="auto" src="left.html"/><frame frameborder="1" 
scrolling="auto" src="invalid.html"/><frame frameborder="1" scrolling="auto" 
src="right.html"/></frameset></frameset></html>
{code}

According to the XHTML 1.0 "frameset" DTD and the HTML 4.01 "frameset" DTD, the 
<frameset> element should NOT be inside of a body tag, which is why you're 
seeing the odd output.

I believe the issue here is that based on TagSoup's state machine architecture, 
the <body> tag has been emitted by the time you get to the <frameset>. TagSoup 
could hang onto the <body> tag until it sees something other than a <frameset>, 
but that feels pretty extreme.

Side note - the HTML is slightly broken, in that <frame src=\"invalid.html\"/> 
is followed by </frame>, but it's already been terminated by the "/>" sequence. 
Don't know if that was intentional or not.

Also strictly speaking you can't have empty <frame> elements, which is what are 
defined in the example. They should be <frame src="blah"> without a </frame>.



> HTMLParser gets an early </body> event
> --------------------------------------
>
>                 Key: TIKA-457
>                 URL: https://issues.apache.org/jira/browse/TIKA-457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Julien Nioche
>
> I am using the IdentityMapper in the HTMLparser with this simple document:
> {code}
> <html><head><title> my title </title>
> </head>
> <body>
> <frameset rows=\"20,*\"> 
> <frame src=\"top.html\">
> </frame>
> <frameset cols=\"20,*\">
> <frame src=\"left.html\">
> </frame>
> <frame src=\"invalid.html\"/>
> </frame>
> <frame src=\"right.html\">
> </frame>
> </frameset>
> </frameset>
> </body></html>
> {code}
> Strangely the HTMLHandler is getting a call to endElement on the body 
> *BEFORE*  we reach frameset. As a result the variable bodylevel is 
> decremented back to 0 and the remaining entities are ignored due to the logic 
> implemented in HTMLHandler.
> Any idea?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to