Sorry for the late reply. Once POI is released, we’ll probably roll out
1.23...probably 3-4 weeks?

Fellow devs, WDYT?

On Mon, Oct 14, 2019 at 6:55 AM <[email protected]> wrote:

> Hi,
>
> Sorry for disturbing, I do see the commit but any hints on when it can be
> released?
>
> I assume it will be a new version of Apache Tika, current version seems to
> be 1.22, so this would be in 1.23?
>
>
>
> Kind regards
>
> Hans
>
>
>
> *Från:* Tim Allison <[email protected]>
> *Skickat:* den 10 oktober 2019 05:05
> *Till:* [email protected]
> *Kopia:* <[email protected]> <[email protected]>
> *Ämne:* Re: [EXTERNAL] Tika Python questions
>
>
>
> Thank you for this report!  I just bumped the max record length for a blob
> by 10x in POI, which should be released fairly soon.
>
>
>
> r1868211
>
>
>
> On Wed, Oct 9, 2019 at 10:20 AM <[email protected]> wrote:
>
> Hi,
> This is an "old" excel spreadsheet, .xls, that is causing it. If you would
> like to I can send that as well.
>
> I hope this gives you what you need from the tika-server stacktrace:
> INFO  rmeta/text (autodetecting type)
> WARN  Ignoring unexpected exception while parsing summary entry
> DocumentSummaryInformation
> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
> length 1186956, but 1000000 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>         at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:568)
>         at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:175)
>         at org.apache.poi.util.IOUtils.safelyAllocate(IOUtils.java:547)
>         at org.apache.poi.hpsf.Blob.read(Blob.java:33)
>         at
> org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:166)
>         at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:176)
>         at org.apache.poi.hpsf.Property.<init>(Property.java:179)
>         at org.apache.poi.hpsf.Section.<init>(Section.java:241)
>         at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:497)
>         at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:195)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:83)
>         at
> org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:74)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:155)
>         at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
>         at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
>         at
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:232)
>         at
> org.apache.tika.server.resource.TikaResource.parse(TikaResource.java:422)
>         at
> org.apache.tika.server.resource.RecursiveMetadataResource.parseMetadata(RecursiveMetadataResource.java:144)
>         at
> org.apache.tika.server.resource.RecursiveMetadataResource.getMetadata(RecursiveMetadataResource.java:121)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:179)
>         at
> org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:201)
>         at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:104)
>         at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:59)
>         at
> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>         at
> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>         at
> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>         at
> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>         at
> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>         at
> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>         at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:205)
>         at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>         at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>         at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>         at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>         at org.eclipse.jetty.server.Server.handle(Server.java:505)
>         at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>         at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>         at org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>         at org.eclipse.jetty.io
> .FillInterest.fillable(FillInterest.java:103)
>         at org.eclipse.jetty.io
> .ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>         at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
>         at
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
>         at java.lang.Thread.run(Thread.java:748)
> xterm
>
> /Kind regards
> Hans
>
> -----Ursprungligt meddelande-----
> Från: Tim Allison <[email protected]>
> Skickat: den 9 oktober 2019 14:04
> Till: Luís Filipe Nassif <[email protected]>
> Kopia: <[email protected]> <[email protected]>;
> [email protected]
> Ämne: Re: [EXTERNAL] Tika Python questions
>
> Yep, that's why we added those limits.
>
> Hans, if you can send the full stacktrace that will allow me to see what
> record type you're running into this with, we may be able to increase it in
> POI before the next release.
>
> On Tue, Oct 8, 2019 at 2:10 PM Luís Filipe Nassif <[email protected]>
> wrote:
> >
> > I think it is not related to file size, but maximum record size
> > handled by POI. It is a protection against OutOfMemoryErrors. I
> > increased this limit to 10M because was seeing many of them. I do not
> > know if it is configurable in tika server.
> >
> > Regards,
> > Luis
> >
> > Em ter, 8 de out de 2019 17:46, Chris Mattmann <[email protected]>
> > escreveu:
> >
> > > Hi,
> > >
> > >
> > >
> > > Thanks for your question. Yes, the same way you set the byte size
> > > property in Tika-App (I think through parser configuration) is how
> > > you would do it for Tika-Server. You would just start the Tika
> > > Server yourself with a custom config file that set this property and
> > > then start it on the default port (making sure any other ones were
> > > killed first). Then Tika-Python will use your own Tika Server with
> > > custom config.
> > >
> > >
> > >
> > > As for catching errors, it will try its best to do that, but it does
> > > not catch all of them and if you find something it doesn’t catch let
> > > us know and we will work to fix it.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Chris
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > From: "[email protected]" <[email protected]>
> > > Organization: Avident-IT
> > > Date: Tuesday, October 8, 2019 at 6:06 AM
> > > To: "Mattmann, Chris A (US 1761)" <[email protected]>
> > > Subject: [EXTERNAL] Tika Python questions
> > >
> > >
> > >
> > > Hi
> > >
> > > I have had the pleasure of testing the Tika-python library. I am
> > > testing it out in a new application that are developed for customers.
> > >
> > > It has very good performance, especially for parsing XLSX and XLS
> files.
> > >
> > >
> > >
> > > However, I have two questions:
> > > The Tika-Server handles only files with a maximum byte size. I get
> > > this
> > > error:
> > > org.apache.poi.util.RecordFormatException: Tried to allocate an
> > > array of length 1186956, but 1000000 is the maximum for this record
> type.
> > >
> > > increasing the maximum allowable size for this record type.
> > >
> > > As a temporary workaround, consider setting a higher override value
> > > with
> > > IOUtils.setByteArrayMaxOverride()
> > >
> > > I have tried the Tika-App python (jar file) and it does handle the
> > > file size where files are larger than 1000000.
> > >
> > > In the Tika documentation it says to set MaxBytes to -1 to override
> > > and handle larger files.
> > >
> > > Is there any way to handle this via Tika-Python? To set max files
> > > size to unlimited as the “Tika-App” handles it?
> > >
> > >
> > > How is it possible to catch errors via the Tika-python library, like
> > > if files are encrypted, corrupt etc.?
> > >
> > >
> > >
> > >
> > > Kind regards
> > >
> > >
> > >
> > > HANS MEIJER
> > >
> > >
> > >
> > >
>
>

Reply via email to