[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789746#comment-13789746
]
Markus Jelsma commented on TIKA-676:
------------------------------------
Oh, i checked. None of my open issues are directly related to Boilerpipe, only
HTML5 stuff that should be fixed in TagSoup instead. I've submitted additions
to TagSoup's html.tssl but that's not likely to be incorporated any time soon.
> Boilerpipe fails
> ----------------
>
> Key: TIKA-676
> URL: https://issues.apache.org/jira/browse/TIKA-676
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Gabriele Kahlout
> Priority: Minor
>
> This is apparently a [boilerpipe issue
> |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the
> [Web API edition | http://boilerpipe-web.appspot.com/].
> {code}
> $ curl --fail -L http://thisrecording.com/the-past | java -jar
> tika-app-0.9.jar -T
> % Total % Received % Xferd Average Speed Time Time Time
> Current
> Dload Upload Total Spent Left Speed
> 100 65688 0 65688 0 0 17650 0 --:--:-- 0:00:03 --:--:--
> 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains
> nested A elements -- You have probably hit a bug in your HTML parser (e.g.,
> NekoHTML bug #2909310). Please clean the HTML externally and feed it to
> boilerpipe again
> 100 128k 0 128k 0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735
> at
> de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
> at
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
> at
> org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> at
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
> at
> org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
> at
> org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
> at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
> at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
> at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)