[ 
https://issues.apache.org/jira/browse/SOLR-11142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier M updated SOLR-11142:
-----------------------------
    Description: 
When adding MSG files which have attachments we systematically get this error:


{code:java}
ERROR (qtp1013423070-16) [   x:default] o.a.s.s.HttpSolrCall 
null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header 
signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file 
appears not to be a valid OLE2 document
        at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162)
        at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111)
        at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
        at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
        at 
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
        at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)
{code}

After inspecting SOLR code it seems the problem comes from here:

In the ExtractingDocumentLoader class we have:

{code:java}
context.set(Parser.class, parser);
{code}

In our case the parser is an instance of OfficeParser.

When processing an MSG file, the OutlookExtractor class is used by the 
OfficeParser.

To process the attachments of the MSG file, the OutlookExtractor calls the 
ParsingEmbeddedDocumentExtractor.

To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the 
DelegatingParser.

The DelegatingParser determines the parser to use by just looking at the parser 
set in the context.


{code:java}
 protected Parser getDelegateParser(ParseContext context) {
        return context.get(Parser.class, EmptyParser.INSTANCE);
    }
{code}


So in our case this means that every attachment will be processed with the 
OfficeParser, even if the attachment is not an MsOffice document !

To make it work correctly, it is an AutoDetectParser that should be set in the 
context when working with MSG files:

{code:java}
context.set(Parser.class, new AutoDetectParser());
{code}








  was:
When adding MSG files which have attachments we systematically get this error:


{code:java}
ERROR (qtp1013423070-16) [   x:default] o.a.s.s.HttpSolrCall 
null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header 
signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file 
appears not to be a valid OLE2 document
        at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162)
        at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112)
        at 
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111)
        at 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
        at 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
        at 
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
        at 
org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
        at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
        at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
        at org.eclipse.jetty.server.Server.handle(Server.java:499)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
        at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
        at java.lang.Thread.run(Thread.java:745)
{code}

After inspecting SOLR code it seems the problem comes from here:

In the ExtractingDocumentLoader class we have:

{code:java}
context.set(Parser.class, parser);
{code}

In our case the parser is an instance of OfficeParser.

When processing an MSG file, the OutlookExtractor class is used by the 
OfficeParser.

To process the attachments of the MSG file, the OutlookExtractor calls the 
ParsingEmbeddedDocumentExtractor.

To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the 
DelegatingParser.

The DelegatingParserdetermines the parser to use by just looking at the parser 
set in the context.


{code:java}
 protected Parser getDelegateParser(ParseContext context) {
        return context.get(Parser.class, EmptyParser.INSTANCE);
    }
{code}


So in our case this means that every attachment will be processed with the 
OfficeParser, even if the attachment is not an MsOffice document !

To make it work correctly, it is an AutoDetectParser that should be set in the 
context when working with MSG files:

{code:java}
context.set(Parser.class, new AutoDetectParser());
{code}









> NotOLE2FileException when adding MSG files with attachments
> -----------------------------------------------------------
>
>                 Key: SOLR-11142
>                 URL: https://issues.apache.org/jira/browse/SOLR-11142
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 5.5.1
>         Environment: Not platform related
>            Reporter: Olivier M
>              Labels: msg, office, parser, tika
>
> When adding MSG files which have attachments we systematically get this error:
> {code:java}
> ERROR (qtp1013423070-16) [   x:default] o.a.s.s.HttpSolrCall 
> null:org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header 
> signature; read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0 - Your file 
> appears not to be a valid OLE2 document
>       at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:162)
>       at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:112)
>       at 
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:302)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:111)
>       at 
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)
>       at 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:103)
>       at 
> org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:129)
>       at 
> org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:238)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>       at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
>       at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:69)
>       at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:2082)
>       at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:651)
>       at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:458)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
>       at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>       at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>       at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>       at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>       at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>       at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>       at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>       at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>       at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>       at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>       at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>       at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>       at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>       at org.eclipse.jetty.server.Server.handle(Server.java:499)
>       at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>       at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>       at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>       at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>       at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> After inspecting SOLR code it seems the problem comes from here:
> In the ExtractingDocumentLoader class we have:
> {code:java}
> context.set(Parser.class, parser);
> {code}
> In our case the parser is an instance of OfficeParser.
> When processing an MSG file, the OutlookExtractor class is used by the 
> OfficeParser.
> To process the attachments of the MSG file, the OutlookExtractor calls the 
> ParsingEmbeddedDocumentExtractor.
> To parse an attachment, the ParsingEmbeddedDocumentExtractor uses the 
> DelegatingParser.
> The DelegatingParser determines the parser to use by just looking at the 
> parser set in the context.
> {code:java}
>  protected Parser getDelegateParser(ParseContext context) {
>         return context.get(Parser.class, EmptyParser.INSTANCE);
>     }
> {code}
> So in our case this means that every attachment will be processed with the 
> OfficeParser, even if the attachment is not an MsOffice document !
> To make it work correctly, it is an AutoDetectParser that should be set in 
> the context when working with MSG files:
> {code:java}
> context.set(Parser.class, new AutoDetectParser());
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to