Sergey Smolyakov created TIKA-3214:
--------------------------------------

             Summary: Tika Fails to extract content from MS Word
                 Key: TIKA-3214
                 URL: https://issues.apache.org/jira/browse/TIKA-3214
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.24.1
            Reporter: Sergey Smolyakov
         Attachments: 200MBFile.zip

Trying to extract content from [^200MBFile.zip] and got an exception: 
TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser
h5. Code for reproducing:
{code:java}
    public static void main(String[] args) throws Exception {
        IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE);
        FileInputStream fileInputStream = new FileInputStream(new 
File("200MBFile.doc"));
        String content = extractContent(fileInputStream);
        System.out.println(content);
    }

    public static String extractContent(InputStream stream)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler(-1);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
        parser.parse(stream, handler, metadata, context);
        return handler.toString();
    }
{code}
h5. Actual result:
{code:java}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.officepar...@6b95f8eorg.apache.tika.exception.TikaException:
 Unexpected RuntimeException from 
org.apache.tika.parser.microsoft.OfficeParser@6b95f8e at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
com.apacheTikaService.TikaAnalysis.extractContentUsingParser(TikaAnalysis.java:28)
 at 
com.apacheTikaService.Servlets.ApacheTikaParserServlet.doPost(ApacheTikaParserServlet.java:22)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at 
javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1418)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:763) at 
org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1633)
 at 
org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:228)
 at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at 
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1609)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) 
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602) 
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
 at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1612)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) 
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1582)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) 
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) 
at org.eclipse.jetty.server.Server.handle(Server.java:516) at 
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) at 
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:556) at 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at 
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
 at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
 at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
 at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
 at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.ArrayIndexOutOfBoundsException: -1 at 
java.util.ArrayList.elementData(ArrayList.java:422) at 
java.util.ArrayList.get(ArrayList.java:435) at 
org.apache.poi.hwpf.usermodel.Range.binarySearchEnd(Range.java:913) at 
org.apache.poi.hwpf.usermodel.Range.findRange(Range.java:962) at 
org.apache.poi.hwpf.usermodel.Range.initCharacterRuns(Range.java:857) at 
org.apache.poi.hwpf.usermodel.Range.numCharacterRuns(Range.java:303) at 
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:226) 
at 
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:733)
 at 
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:723)
 at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:175) at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:176) at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 
more
{code}
h5. Expected result:

The content was extracted successfully 
h5. Related bug:

[https://bz.apache.org/bugzilla/show_bug.cgi?id=64853]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to