[ 
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049338#comment-15049338
 ] 

Tim Allison commented on TIKA-985:
----------------------------------

This is what we get from out-of-the-box JSoup for the test file included in 
TIKA-980.

{noformat}
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.html.HtmlParser" />
<meta name="dc:title" content="Tika microdata test" />
<meta name="Content-Encoding" content="UTF-8" />
<meta name="Content-Type-Hint" content="text/html; charset=utf-8" />
<meta name="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<title>Tika microdata test</title>
</head>
<body>
        <ul>
                        <li><a href="http://tika.apache.org/";>Apache 
Tika</a></li>

                        <li>Microdata</li>

          </ul>


        
                    <h2>ApacheCon Europe 2012\</h2>

                    Details of the annual ApacheCon meetings held in Europe and 
the United States, with registration information and an archive of previous 
meetings.

                    
                        Sinsheim, Germany
                

                    <a href="http://apachecon.eu/";>apachecon.eu</a>

                    2012-11-05
                    a few days
                    2012-11-08

                    
                    17.50
            


        
                    <h2>ApacheCon North America 2013</h2>

                    Details of the annual ApacheCon meetings held in Europe and 
the United States, with registration information and an archive of previous 
meetings.

                    
                        Portland, Oregon
                

                    <a href="http://na.apachecon.com/";>na.apachecon.com</a>

                    2013-02-24
                    a few days
                    2013-03-02

                    
                    17.50
            


</body></html>
{noformat}

This is very close to what we're getting with our old tagsoup:
{noformat}
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.html.HtmlParser" />
<meta name="dc:title" content="Tika microdata test" />
<meta name="Content-Encoding" content="UTF-8" />
<meta name="Content-Type-Hint" content="text/html; charset=utf-8" />
<meta name="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<title>Tika microdata test</title>
</head>
<body>
        <ul>    <li><a shape="rect" href="http://tika.apache.org/";>Apache 
Tika</a></li>
        <li>Microdata</li>
</ul>


        
                    <h2>ApacheCon Europe 2012\</h2>

                    Details of the annual ApacheCon meetings held in Europe and 
the United States, with registration information and an archive of previous 
meetings.

                    
                        Sinsheim, Germany
                

                    <a shape="rect" href="http://apachecon.eu/";>apachecon.eu</a>

                    2012-11-05
                    a few days
                    2012-11-08

                    
17.50
            

        
                    <h2>ApacheCon North America 2013</h2>

                    Details of the annual ApacheCon meetings held in Europe and 
the United States, with registration information and an archive of previous 
meetings.

                    
                        Portland, Oregon
                

                    <a shape="rect" 
href="http://na.apachecon.com/";>na.apachecon.com</a>

                    2013-02-24
                    a few days
                    2013-03-02

                    
17.50
            
</body></html>
{noformat}

One difference is that we lose the shape attribute with JSoup

> Support for HTML5 elements
> --------------------------
>
>                 Key: TIKA-985
>                 URL: https://issues.apache.org/jira/browse/TIKA-985
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.2
>            Reporter: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch, 
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article, 
> section). This prevents some custom ContentHandlers from reading expected 
> elements and/or attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to