[
https://issues.apache.org/jira/browse/TIKA-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler closed TIKA-1101.
-----------------------------
Resolution: Not A Problem
Assignee: Ken Krugler
Hi David,
This isn't a Tika bug, in that HTML can't be parsed by the XML parser, which
(correctly) complains. The problem is caused by trying to parse an HTML
fragment.
You can work around this issue by explicitly using the HTML parser, versus the
auto-detect parser that tries to guess at the right parser to use (not sure how
to do that in ManifoldCF).
You could file an enhancement request to have Tika auto-detect HTML even
without a proper header.
-- Ken
> XML parse error caused by org.xml.sax.SAXParseException;The entity "nbsp" was
> referenced, but not declared
> ----------------------------------------------------------------------------------------------------------
>
> Key: TIKA-1101
> URL: https://issues.apache.org/jira/browse/TIKA-1101
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.2, 1.3
> Environment: I'm using solr 4.0 final with tika 1.2 and ManifoldCF
> v1.2 dev on tomcat 7 (RHL)
> Reporter: David Morana
> Assignee: Ken Krugler
> Fix For: 1.3, 1.2
>
>
> Good afternoon,
> This web page (see below) when crawled by ManifoldCF causes severe errors in
> Solr and causes ManifoldCF to abort the current job.
> I verified the error by sending the URL to tika-app 1.2 and 1.3.
> I can't find any kind of a fix for this.
> Please advise...
> P.S. can you also provide a list of all tika supporting jars? (i.e. poi,
> jempbox etc etc)
> Thanks,
> Here's the HTML
> {code}
> <div id="leftcol">
> <ul>
> <li><a href="/mission/sec/sec.html"> Security and Information
> Sciences Home ›</a> </li>
> <li><a
> href="/mission/sec/publications/-publications.html">Publications ›</a>
> </li>
> <li><a
> href="/mission/sec/corpora/corpora.html">Corpora ›</a> </li>
> <li><a href="/mission/sec/softwaretools/tools.html">Software
> Tools ›</a></li>
> <li><a href="/mission/sec/CSO/CSO.html"> Systems and
> Operations ›</a>
> <ul>
> <li><a
> href="/mission/sec/publications/-publications.html">Publications
> ›</a></li>
> <li><a
> href="/mission/sec/CSO/biographies/CSObios.html">Biographies ›</a></li>
> </ul>
> </li>
> <li><a href="/mission/sec/CST/CST.html"> Systems and
> Technology ›</a> </li>
> <li><a href="/mission/sec/CSA/CSA.html"> System
> Assessments ›</a> </li>
> <li><a href="/mission/sec/HLT/HLT.html">Human Language
> Technology ›</a>
> <li><a href="/mission/sec/computing/computing.html">Computing and
> Analytics ›</a></li>
> </ul>
> </div>
> {code}
> Here's the error:
> {code}
> Apr 03, 2013 4:23:23 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: XML parse error
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:455)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:276)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
> at
> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:936)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004)
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
> at
> org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1686)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.tika.exception.TikaException: XML parse error
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> ... 21 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 105;
> The entity "nbsp" was referenced, but not declared.
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
> at
> com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441)
> at
> com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368)
> at
> com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1861)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2994)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:607)
> at
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116)
> at
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:489)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
> at
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
> at
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
> at
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
> at
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:302)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72)
> ... 25 more
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira