[
https://issues.apache.org/jira/browse/TIKA-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049338#comment-15049338
]
Tim Allison commented on TIKA-985:
----------------------------------
This is what we get from out-of-the-box JSoup for the test file included in
TIKA-980.
{noformat}
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.html.HtmlParser" />
<meta name="dc:title" content="Tika microdata test" />
<meta name="Content-Encoding" content="UTF-8" />
<meta name="Content-Type-Hint" content="text/html; charset=utf-8" />
<meta name="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<title>Tika microdata test</title>
</head>
<body>
<ul>
<li><a href="http://tika.apache.org/">Apache
Tika</a></li>
<li>Microdata</li>
</ul>
<h2>ApacheCon Europe 2012\</h2>
Details of the annual ApacheCon meetings held in Europe and
the United States, with registration information and an archive of previous
meetings.
Sinsheim, Germany
<a href="http://apachecon.eu/">apachecon.eu</a>
2012-11-05
a few days
2012-11-08
17.50
<h2>ApacheCon North America 2013</h2>
Details of the annual ApacheCon meetings held in Europe and
the United States, with registration information and an archive of previous
meetings.
Portland, Oregon
<a href="http://na.apachecon.com/">na.apachecon.com</a>
2013-02-24
a few days
2013-03-02
17.50
</body></html>
{noformat}
This is very close to what we're getting with our old tagsoup:
{noformat}
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.html.HtmlParser" />
<meta name="dc:title" content="Tika microdata test" />
<meta name="Content-Encoding" content="UTF-8" />
<meta name="Content-Type-Hint" content="text/html; charset=utf-8" />
<meta name="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<title>Tika microdata test</title>
</head>
<body>
<ul> <li><a shape="rect" href="http://tika.apache.org/">Apache
Tika</a></li>
<li>Microdata</li>
</ul>
<h2>ApacheCon Europe 2012\</h2>
Details of the annual ApacheCon meetings held in Europe and
the United States, with registration information and an archive of previous
meetings.
Sinsheim, Germany
<a shape="rect" href="http://apachecon.eu/">apachecon.eu</a>
2012-11-05
a few days
2012-11-08
17.50
<h2>ApacheCon North America 2013</h2>
Details of the annual ApacheCon meetings held in Europe and
the United States, with registration information and an archive of previous
meetings.
Portland, Oregon
<a shape="rect"
href="http://na.apachecon.com/">na.apachecon.com</a>
2013-02-24
a few days
2013-03-02
17.50
</body></html>
{noformat}
One difference is that we lose the shape attribute with JSoup
> Support for HTML5 elements
> --------------------------
>
> Key: TIKA-985
> URL: https://issues.apache.org/jira/browse/TIKA-985
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.2
> Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: TIKA-985-1.3-1.patch, TIKA-985-1.3-2.patch,
> TIKA-985-1.3-3.patch, TIKA-985-1.5.patch
>
>
> TagSoup's schema.tssl does not include some HTML5 elements (e.g. article,
> section). This prevents some custom ContentHandlers from reading expected
> elements and/or attributes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)