Hi,

Sorry for disturbing, I do see the commit but any hints on when it can be 
released?

I assume it will be a new version of Apache Tika, current version seems to be 
1.22, so this would be in 1.23?

 

Kind regards

Hans

 

Från: Tim Allison <[email protected]> 
Skickat: den 10 oktober 2019 05:05
Till: [email protected]
Kopia: <[email protected]> <[email protected]>
Ämne: Re: [EXTERNAL] Tika Python questions

 

Thank you for this report!  I just bumped the max record length for a blob by 
10x in POI, which should be released fairly soon.

 

r1868211

 

On Wed, Oct 9, 2019 at 10:20 AM <[email protected] 
<mailto:[email protected]> > wrote:

Hi,
This is an "old" excel spreadsheet, .xls, that is causing it. If you would like 
to I can send that as well.

I hope this gives you what you need from the tika-server stacktrace:
INFO  rmeta/text (autodetecting type)
WARN  Ignoring unexpected exception while parsing summary entry 
DocumentSummaryInformation
org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 
1186956, but 1000000 is the maximum for this record type.
If the file is not corrupt, please open an issue on bugzilla to request
increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with 
IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
        at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
        at org.apache.poi.hpsf.Blob.read(Blob.java:33)
        at 
org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
        at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
        at org.apache.poi.hpsf.Property.<init>(Property.java:179)
        at org.apache.poi.hpsf.Section.<init>(Section.java:241)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
        at 
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
        at 
org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
        at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
        at 
org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
        at 
org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
        at 
org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
        at 
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
        at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
        at 
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
        at 
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
        at 
org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
        at 
org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
        at 
org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
        at 
org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
        at org.eclipse.jetty.server.Server.handle(Server.java:505)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
        at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> 
.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> 
.FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io <http://org.eclipse.jetty.io> 
.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
        at java.lang.Thread.run(Thread.java:748)
xterm

/Kind regards
Hans

-----Ursprungligt meddelande-----
Från: Tim Allison <[email protected] <mailto:[email protected]> > 
Skickat: den 9 oktober 2019 14:04
Till: Luís Filipe Nassif <[email protected] <mailto:[email protected]> >
Kopia: <[email protected] <mailto:[email protected]> > <[email protected] 
<mailto:[email protected]> >; [email protected] 
<mailto:[email protected]> 
Ämne: Re: [EXTERNAL] Tika Python questions

Yep, that's why we added those limits.

Hans, if you can send the full stacktrace that will allow me to see what record 
type you're running into this with, we may be able to increase it in POI before 
the next release.

On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[email protected] 
<mailto:[email protected]> > wrote:
>
> I think it is not related to file size, but maximum record size 
> handled by POI. It is a protection against OutOfMemoryErrors. I 
> increased this limit to 10M because was seeing many of them. I do not 
> know if it is configurable in tika server.
>
> Regards,
> Luis
>
> Em ter, 8 de out de 2019 17:46, Chris Mattmann <[email protected] 
> <mailto:[email protected]> >
> escreveu:
>
> > Hi,
> >
> >
> >
> > Thanks for your question. Yes, the same way you set the byte size 
> > property in Tika-App (I think through parser configuration) is how 
> > you would do it for Tika-Server. You would just start the Tika 
> > Server yourself with a custom config file that set this property and 
> > then start it on the default port (making sure any other ones were 
> > killed first). Then Tika-Python will use your own Tika Server with 
> > custom config.
> >
> >
> >
> > As for catching errors, it will try its best to do that, but it does 
> > not catch all of them and if you find something it doesn’t catch let 
> > us know and we will work to fix it.
> >
> >
> >
> > Thanks,
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > From: "[email protected] <mailto:[email protected]> " 
> > <[email protected] <mailto:[email protected]> >
> > Organization: Avident-IT
> > Date: Tuesday, October 8, 2019 at 6:06 AM
> > To: "Mattmann, Chris A (US 1761)" <[email protected] 
> > <mailto:[email protected]> >
> > Subject: [EXTERNAL] Tika Python questions
> >
> >
> >
> > Hi
> >
> > I have had the pleasure of testing the Tika-python library. I am 
> > testing it out in a new application that are developed for customers.
> >
> > It has very good performance, especially for parsing XLSX and XLS files.
> >
> >
> >
> > However, I have two questions:
> > The Tika-Server handles only files with a maximum byte size. I get 
> > this
> > error:
> > org.apache.poi.util.RecordFormatException: Tried to allocate an 
> > array of length 1186956, but 1000000 is the maximum for this record type.
> >
> > increasing the maximum allowable size for this record type.
> >
> > As a temporary workaround, consider setting a higher override value 
> > with
> > IOUtils.setByteArrayMaxOverride()
> >
> > I have tried the Tika-App python (jar file) and it does handle the 
> > file size where files are larger than 1000000.
> >
> > In the Tika documentation it says to set MaxBytes to -1 to override 
> > and handle larger files.
> >
> > Is there any way to handle this via Tika-Python? To set max files 
> > size to unlimited as the “Tika-App” handles it?
> >
> >
> > How is it possible to catch errors via the Tika-python library, like 
> > if files are encrypted, corrupt etc.?
> >
> >
> >
> >
> > Kind regards
> >
> >
> >
> > HANS MEIJER
> >
> >
> >
> >

Reply via email to