[
https://issues.apache.org/jira/browse/TIKA-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871611#action_12871611
]
Jukka Zitting commented on TIKA-427:
------------------------------------
The type detection code in Tika gets confused by the <!-- comment --> at the
beginning of the file.
We should probably make the XML detector look beyond the first comment(s) to
see what the followup text really looks like. Alternatively we could capture
early parse errors in the XMLParser class and fall back to TXTParser in such
cases.
> Parsing CSS as XML
> ------------------
>
> Key: TIKA-427
> URL: https://issues.apache.org/jira/browse/TIKA-427
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7
> Reporter: Erik Hetzner
> Priority: Minor
>
> Perhaps related to TIKA-426?
> $ curl -s http://datacenter.cit.nih.gov/interface/styles/nihstyles.css | java
> -jar tika-app-0.7.jar
> Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-237:
> Illegal SAXException from org.apache.tika.parser.xml.dcxmlpar...@28bb0d0d
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:142)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:99)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:155)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:65)
> Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195)
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174)
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388)
> at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1039)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
> at
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
> at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:86)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:132)
> ... 3 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.