[ 
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13789746#comment-13789746
 ] 

Markus Jelsma commented on TIKA-676:
------------------------------------

Oh, i checked. None of my open issues are directly related to Boilerpipe, only 
HTML5 stuff that should be fixed in TagSoup instead. I've submitted additions 
to TagSoup's html.tssl but that's not likely to be incorporated any time soon.

> Boilerpipe fails
> ----------------
>
>                 Key: TIKA-676
>                 URL: https://issues.apache.org/jira/browse/TIKA-676
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>
> This is apparently a [boilerpipe issue 
> |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the 
> [Web API edition | http://boilerpipe-web.appspot.com/]. 
> {code}
> $ curl --fail -L http://thisrecording.com/the-past | java -jar 
> tika-app-0.9.jar -T
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  
> Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 65688    0 65688    0     0  17650      0 --:--:--  0:00:03 --:--:-- 
> 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains 
> nested A elements -- You have probably hit a bug in your HTML parser (e.g., 
> NekoHTML bug #2909310). Please clean the HTML externally and feed it to 
> boilerpipe again
> 100  128k    0  128k    0     0  32019      0 --:--:--  0:00:04 --:--:-- 33735
>       at 
> de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
>       at 
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
>       at 
> org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
>       at 
> org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
>       at 
> org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
>       at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
>       at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
>       at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
>       at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
>       at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>       at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>       at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
>       at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
>       at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to