[ 
https://issues.apache.org/jira/browse/TIKA-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13529958#comment-13529958
 ] 

David Morana commented on TIKA-1041:
------------------------------------

ok, I managed to find the universal charset jar v1.0.3.
I can get a little further now.
Unfortunately, I'm getting this tika parsing error: I'm using v1.2
null:org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@137f0949 at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:215)
 at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) 
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263)
 at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:244)
 at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:240)
 at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
 at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) 
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) 
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541) at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:383) 
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:243) 
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:188)
 at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:166)
 at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:288)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) 
at java.lang.Thread.run(Thread.java:722) Caused by: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@137f0949 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:209)
 ... 21 more Caused by: java.lang.ArrayIndexOutOfBoundsException: Illegal 
offset 8 (String data is of length 8) at 
org.apache.poi.util.StringUtil.getFromUnicodeLE(StringUtil.java:70) at 
org.apache.poi.hdgf.chunks.Chunk.processCommands(Chunk.java:203) at 
org.apache.poi.hdgf.chunks.ChunkFactory.createChunk(ChunkFactory.java:180) at 
org.apache.poi.hdgf.streams.ChunkStream.findChunks(ChunkStream.java:59) at 
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:93)
 at 
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
 at 
org.apache.poi.hdgf.streams.PointerContainingStream.findChildren(PointerContainingStream.java:100)
 at org.apache.poi.hdgf.HDGFDiagram.<init>(HDGFDiagram.java:106) at 
org.apache.poi.hdgf.extractor.VisioTextExtractor.<init>(VisioTextExtractor.java:55)
 at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:200) 
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) 
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 
24 more 
                
> Tika 1.2 universalcharset errors
> --------------------------------
>
>                 Key: TIKA-1041
>                 URL: https://issues.apache.org/jira/browse/TIKA-1041
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.2
>         Environment: I'm running solr 4.0 with tika 1.2 on tomcat 7.0.8 with 
> manifoldcf v1.1dev 
>            Reporter: David Morana
>             Fix For: 1.2, 1.3
>
>
> This is somewhat confusing and frustrating. I successfully crawled Opentext 
> using all of the above. then I recrawled and it aborted almost immediately.
> It choked on images, so I excluded them for now. 
> but now it's choking on txt files! 
> sometimes I get this error
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> org/mozilla/universalchardet/CharsetListener
> and sometimes I get this one
> SEVERE: null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> org/apache/tika/parser/txt/UniversalEncodingListener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to