Boilerpipe fails
----------------

                 Key: TIKA-676
                 URL: https://issues.apache.org/jira/browse/TIKA-676
             Project: Tika
          Issue Type: Bug
            Reporter: Gabriele Kahlout
            Priority: Minor
             Fix For: 1.0


This is apparently a boilerpipe issue, they fixed in the [Web API edition | 
http://boilerpipe-web.appspot.com/]. 
{code}
$ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar 
-T
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 65688    0 65688    0     0  17650      0 --:--:--  0:00:03 --:--:-- 
18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains 
nested A elements -- You have probably hit a bug in your HTML parser (e.g., 
NekoHTML bug #2909310). Please clean the HTML externally and feed it to 
boilerpipe again
100  128k    0  128k    0     0  32019      0 --:--:--  0:00:04 --:--:-- 33735
        at 
de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
        at 
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
        at 
org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
        at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
        at 
org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
        at 
org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
        at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
        at 
org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
        at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
        at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
        at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to