[
https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392248#comment-14392248
]
Rupert Westenthaler commented on TIKA-676:
------------------------------------------
FYI: This issue is still present. I just recently got this with
http://www.mico-project.eu/mico_team/adam-dahlgren-lindstrom/
I implemented a simple workaround by wrapping boilerpipe with
{code}
public class TIKA676WorkaroundHandler extends ContentHandlerDecorator {
private final Logger log = LoggerFactory.getLogger(getClass());
public static final String A = "a";
private boolean inLink = false;
public TIKA676WorkaroundHandler(ContentHandler handler) {
super(handler == null ? new DefaultHandler() : handler);
}
@Override
public void startElement(String elemUri, String localName, String name,
Attributes atts) throws SAXException {
if(A.equalsIgnoreCase(localName)){
if(inLink){
log.warn(" - closing open link before next one is starting!");
endElement(elemUri, localName, name);
}
inLink = true;
}
super.startElement(elemUri, localName, name, atts);
}
@Override
public void endElement(String uri, String localName, String name)
throws SAXException {
if(A.equalsIgnoreCase(localName)){
if(inLink){
super.endElement(uri, localName, name);
inLink = false;
} else {
log.warn(" - ignoring closing link that was missing before");
}
} else {
super.endElement(uri, localName, name);
}
}
}
{code}
> Boilerpipe fails
> ----------------
>
> Key: TIKA-676
> URL: https://issues.apache.org/jira/browse/TIKA-676
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Gabriele Kahlout
> Priority: Minor
>
> This is apparently a [boilerpipe issue
> |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the
> [Web API edition | http://boilerpipe-web.appspot.com/].
> {code}
> $ curl --fail -L http://thisrecording.com/the-past | java -jar
> tika-app-0.9.jar -T
> % Total % Received % Xferd Average Speed Time Time Time
> Current
> Dload Upload Total Spent Left Speed
> 100 65688 0 65688 0 0 17650 0 --:--:-- 0:00:03 --:--:--
> 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains
> nested A elements -- You have probably hit a bug in your HTML parser (e.g.,
> NekoHTML bug #2909310). Please clean the HTML externally and feed it to
> boilerpipe again
> 100 128k 0 128k 0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735
> at
> de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
> at
> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
> at
> org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> at
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
> at
> org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
> at
> org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
> at
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> at
> org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
> at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
> at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
> at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
> at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
> at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)