Sergey Smolyakov created TIKA-3214:
--------------------------------------
Summary: Tika Fails to extract content from MS Word
Key: TIKA-3214
URL: https://issues.apache.org/jira/browse/TIKA-3214
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.24.1
Reporter: Sergey Smolyakov
Attachments: 200MBFile.zip
Trying to extract content from [^200MBFile.zip] and got an exception:
TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser
h5. Code for reproducing:
{code:java}
public static void main(String[] args) throws Exception {
IOUtils.setByteArrayMaxOverride(Integer.MAX_VALUE);
FileInputStream fileInputStream = new FileInputStream(new
File("200MBFile.doc"));
String content = extractContent(fileInputStream);
System.out.println(content);
}
public static String extractContent(InputStream stream)
throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
return handler.toString();
}
{code}
h5. Actual result:
{code:java}
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.officepar...@6b95f8eorg.apache.tika.exception.TikaException:
Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@6b95f8e at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at
com.apacheTikaService.TikaAnalysis.extractContentUsingParser(TikaAnalysis.java:28)
at
com.apacheTikaService.Servlets.ApacheTikaParserServlet.doPost(ApacheTikaParserServlet.java:22)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at
org.eclipse.jetty.servlet.ServletHolder$NotAsync.service(ServletHolder.java:1418)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:763) at
org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1633)
at
org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:228)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193) at
org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1609)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:602)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1612)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1582)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:516) at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383) at
org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:556) at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375) at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273) at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) at
org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
at java.lang.Thread.run(Thread.java:748)Caused by:
java.lang.ArrayIndexOutOfBoundsException: -1 at
java.util.ArrayList.elementData(ArrayList.java:422) at
java.util.ArrayList.get(ArrayList.java:435) at
org.apache.poi.hwpf.usermodel.Range.binarySearchEnd(Range.java:913) at
org.apache.poi.hwpf.usermodel.Range.findRange(Range.java:962) at
org.apache.poi.hwpf.usermodel.Range.initCharacterRuns(Range.java:857) at
org.apache.poi.hwpf.usermodel.Range.numCharacterRuns(Range.java:303) at
org.apache.poi.hwpf.model.PicturesTable.getAllPictures(PicturesTable.java:226)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:733)
at
org.apache.tika.parser.microsoft.WordExtractor$PicturesSource.<init>(WordExtractor.java:723)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:175) at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:176) at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44
more
{code}
h5. Expected result:
The content was extracted successfully
h5. Related bug:
[https://bz.apache.org/bugzilla/show_bug.cgi?id=64853]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)