Olivier M created SOLR-11142:
--------------------------------
Summary: NotOLE2FileException when adding MSG files with
attachments
Key: SOLR-11142
URL: https://issues.apache.org/jira/browse/SOLR-11142
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 5.5.1
Environment: Not platform related
Reporter: Olivier M
When adding MSG files which have attachments we systematically get this error:
{code:java}
ERROR (qtp1013423070-16) [ x:default] o.a.s.s.HttpSolrCall
null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header
signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file
appears not to be a valid OLE2 document
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111)
at
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
at
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
at
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
at
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:499)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
{code}
After inspecting SOLR code it seems the problem comes from here:
In the ExtractingDocumentLoader class we have:
{code:java}
context.set(Parser.class, parser);
{code}
In our case the parser is an instance of OfficeParser.
When processing an MSG file, the OutlookExtractor class is used by the
OfficeParser.
To process the attachments of the MSG file, the OutlookExtractor calls the
ParsingEmbeddedDocumentExtractor.
To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the
DelegatingParser.
The DelegatingParserdetermines the parser to use by just looking at the parser
set in the context.
{code:java}
protected Parser getDelegateParser(ParseContext context) {
return context.get(Parser.class, EmptyParser.INSTANCE);
}
{code}
So in our case this means that every attachment will be processed with the
OfficeParser, even if the attachment is not an MsOffice document !
To make it work correctly, it is an AutoDetectParser that should be set in the
context when working with MSG files:
{code:java}
context.set(Parser.class, new AutoDetectParser());
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]