[ https://issues.apache.org/jira/browse/TIKA-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010803#comment-18010803 ]
Manish S N commented on TIKA-4459: ---------------------------------- i collected some more Open Document files and msOffice files from the internet and ran the test again results are: run 1: {code:java} #$# spooling: true, fileCount: 151, meanTime: 35.35, stdDeviation: 58.17, minTime: 1.0, maxTime: 364.0, medianTime: 11.0 #$# spooling: false, fileCount: 151, meanTime: 38.40, stdDeviation: 68.07, minTime: 1.0, maxTime: 422.0, medianTime: 10.0 {code} run 2: {code:java} #$# spooling: true, fileCount: 151, meanTime: 35.65, stdDeviation: 60.67, minTime: 1.0, maxTime: 411.0, medianTime: 11.0 #$# spooling: false, fileCount: 151, meanTime: 39.63, stdDeviation: 70.73, minTime: 0.0, maxTime: 416.0, medianTime: 11.0 {code} run 3: {code:java} #$# spooling: true, fileCount: 151, meanTime: 35.87, stdDeviation: 58.69, minTime: 1.0, maxTime: 378.0, medianTime: 10.0 #$# spooling: false, fileCount: 151, meanTime: 41.48, stdDeviation: 74.02, minTime: 0.0, maxTime: 440.0, medianTime: 10.0 {code} You can see the spooling variant has better mean and maxTime in all runs *_Hence it is inferred that the parser is more efficient with ZipFile than ZipStream_* (Also it is the one that handles errors properly) So can we change the OpenDocumentParser to spool files by default? P.S: as for the SSD write limit concern, there is [this|https://linustechtips.com/topic/811454-should-i-be-worried-of-ssd-write-limit/] Linus tech tips discussion and [this|https://superuser.com/questions/345997/what-happens-when-an-ssd-wears-out] super user forum both agree that it is a myth > protected ODF encryption detection fail > --------------------------------------- > > Key: TIKA-4459 > URL: https://issues.apache.org/jira/browse/TIKA-4459 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.2.1 > Environment: Ubuntu 24.04.2 LTS x86_64 > Reporter: Manish S N > Priority: Minor > Labels: encryption, odf, open-document-format, protected, > regression, zip > Fix For: 4.0.0, 3.2.2 > > Attachments: protected.odt, testProtected.odp > > > When passing inputstream of protected odf format file to tika we get a > ZipException instead of a EncryptedDocumentException. > This works well and correctly throws EncryptedDocumentException if you create > TikaInputStream with Path or call TikaInputStream.getPath() as it will write > to a temporary file in memory. > But when working with InputStreams we get the following zip exception: > > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from > org.apache.tika.parser.odf.OpenDocumentParser@bae47a0 > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:304) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) > at org.apache.tika.Tika.parseToString(Tika.java:525) > at org.apache.tika.Tika.parseToString(Tika.java:495) > at org.manish.AttachmentParser.parse(AttachmentParser.java:21) > at org.manish.AttachmentParser.lambda$testParse$1(AttachmentParser.java:72) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) > at > java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) > at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) > at > java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > at > java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) > at > java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) > at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) > at org.manish.AttachmentParser.testParse(AttachmentParser.java:64) > at org.manish.AttachmentParser.main(AttachmentParser.java:57) > Caused by: java.util.zip.ZipException: only DEFLATED entries can have EXT > descriptor > at java.base/java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:313) > at > java.base/java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:125) > at > org.apache.tika.parser.odf.OpenDocumentParser.handleZipStream(OpenDocumentParser.java:218) > at > org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:169) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) > ... 19 more > > (We use tika to detect encrypted docs) -- This message was sent by Atlassian Jira (v8.20.10#820010)