Sorry for the late reply. Once POI is released, we’ll probably roll out 1.23...probably 3-4 weeks?
Fellow devs, WDYT? On Mon, Oct 14, 2019 at 6:55 AM <[email protected]> wrote: > Hi, > > Sorry for disturbing, I do see the commit but any hints on when it can be > released? > > I assume it will be a new version of Apache Tika, current version seems to > be 1.22, so this would be in 1.23? > > > > Kind regards > > Hans > > > > *Från:* Tim Allison <[email protected]> > *Skickat:* den 10 oktober 2019 05:05 > *Till:* [email protected] > *Kopia:* <[email protected]> <[email protected]> > *Ämne:* Re: [EXTERNAL] Tika Python questions > > > > Thank you for this report! I just bumped the max record length for a blob > by 10x in POI, which should be released fairly soon. > > > > r1868211 > > > > On Wed, Oct 9, 2019 at 10:20 AM <[email protected]> wrote: > > Hi, > This is an "old" excel spreadsheet, .xls, that is causing it. If you would > like to I can send that as well. > > I hope this gives you what you need from the tika-server stacktrace: > INFO rmeta/text (autodetecting type) > WARN Ignoring unexpected exception while parsing summary entry > DocumentSummaryInformation > org.apache.poi.util.RecordFormatException: Tried to allocate an array of > length 1186956, but 1000000 is the maximum for this record type. > If the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. > As a temporary workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568) > at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175) > at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547) > at org.apache.poi.hpsf.Blob.read(Blob.java:33) > at > org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166) > at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176) > at org.apache.poi.hpsf.Property.<init>(Property.java:179) > at org.apache.poi.hpsf.Section.<init>(Section.java:241) > at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497) > at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195) > at > org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83) > at > org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) > at > org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232) > at > org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422) > at > org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144) > at > org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179) > at > org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201) > at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59) > at > org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96) > at > org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308) > at > org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121) > at > org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267) > at > org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247) > at > org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.eclipse.jetty.server.Server.handle(Server.java:505) > at > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at org.eclipse.jetty.io > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.eclipse.jetty.io > .FillInterest.fillable(FillInterest.java:103) > at org.eclipse.jetty.io > .ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) > at java.lang.Thread.run(Thread.java:748) > xterm > > /Kind regards > Hans > > -----Ursprungligt meddelande----- > Från: Tim Allison <[email protected]> > Skickat: den 9 oktober 2019 14:04 > Till: Luís Filipe Nassif <[email protected]> > Kopia: <[email protected]> <[email protected]>; > [email protected] > Ämne: Re: [EXTERNAL] Tika Python questions > > Yep, that's why we added those limits. > > Hans, if you can send the full stacktrace that will allow me to see what > record type you're running into this with, we may be able to increase it in > POI before the next release. > > On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[email protected]> > wrote: > > > > I think it is not related to file size, but maximum record size > > handled by POI. It is a protection against OutOfMemoryErrors. I > > increased this limit to 10M because was seeing many of them. I do not > > know if it is configurable in tika server. > > > > Regards, > > Luis > > > > Em ter, 8 de out de 2019 17:46, Chris Mattmann <[email protected]> > > escreveu: > > > > > Hi, > > > > > > > > > > > > Thanks for your question. Yes, the same way you set the byte size > > > property in Tika-App (I think through parser configuration) is how > > > you would do it for Tika-Server. You would just start the Tika > > > Server yourself with a custom config file that set this property and > > > then start it on the default port (making sure any other ones were > > > killed first). Then Tika-Python will use your own Tika Server with > > > custom config. > > > > > > > > > > > > As for catching errors, it will try its best to do that, but it does > > > not catch all of them and if you find something it doesn’t catch let > > > us know and we will work to fix it. > > > > > > > > > > > > Thanks, > > > > > > Chris > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: "[email protected]" <[email protected]> > > > Organization: Avident-IT > > > Date: Tuesday, October 8, 2019 at 6:06 AM > > > To: "Mattmann, Chris A (US 1761)" <[email protected]> > > > Subject: [EXTERNAL] Tika Python questions > > > > > > > > > > > > Hi > > > > > > I have had the pleasure of testing the Tika-python library. I am > > > testing it out in a new application that are developed for customers. > > > > > > It has very good performance, especially for parsing XLSX and XLS > files. > > > > > > > > > > > > However, I have two questions: > > > The Tika-Server handles only files with a maximum byte size. I get > > > this > > > error: > > > org.apache.poi.util.RecordFormatException: Tried to allocate an > > > array of length 1186956, but 1000000 is the maximum for this record > type. > > > > > > increasing the maximum allowable size for this record type. > > > > > > As a temporary workaround, consider setting a higher override value > > > with > > > IOUtils.setByteArrayMaxOverride() > > > > > > I have tried the Tika-App python (jar file) and it does handle the > > > file size where files are larger than 1000000. > > > > > > In the Tika documentation it says to set MaxBytes to -1 to override > > > and handle larger files. > > > > > > Is there any way to handle this via Tika-Python? To set max files > > > size to unlimited as the “Tika-App” handles it? > > > > > > > > > How is it possible to catch errors via the Tika-python library, like > > > if files are encrypted, corrupt etc.? > > > > > > > > > > > > > > > Kind regards > > > > > > > > > > > > HANS MEIJER > > > > > > > > > > > > > >
