[
https://issues.apache.org/jira/browse/TIKA-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16564045#comment-16564045
]
Tim Allison edited comment on TIKA-2700 at 7/31/18 5:39 PM:
------------------------------------------------------------
See TIKA-1599 for a discussion of switching to JSoup. The big drawback (at
least last I looked), is that JSoup only supported DOM parsing and didn't have
a SAX-like interface...but, right, currently supported and fixable is better
than relying on a non-supported library, even though tag soup is pretty awesome.
Recommendations for better HTML parsing in Java with an ASF 2.0 friendly
license?
was (Author: [email protected]):
See TIKA-1599 for a discussion of switching to JSoup. The big drawback (at
least last I looked), is that JSoup only supported DOM parsing and didn't have
a SAX-like interface.
Recommendations for better HTML parsing in Java with an ASF 2.0 friendly
license?
> The HTML parser should parse the contents of the title tag as raw text, not
> HTML
> --------------------------------------------------------------------------------
>
> Key: TIKA-2700
> URL: https://issues.apache.org/jira/browse/TIKA-2700
> Project: Tika
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
> Attachments: title.html
>
>
> The current HTML parser in tika fails to extract the correct document title
> when it contains at least one unescaped '<' character.
>
> For instance, in the following HTML document:
> {code:html}
> <html><title>title with a <b>tag</b> in it</title><body></body></html>
> {code}
> the extracted title is
> {code}
> title with a
> {code}
> Browsers however respect the [html parsing
> specification|https://www.w3.org/TR/2011/WD-html5-20110113/tokenization.html#parsing-main-inhead],
> and display this title as
> {code}
> title with a <b>tag</b> in it
> {code}
> (with a literal _<b>_)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)