[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086436#comment-16086436 ] Tim Allison commented on TIKA-2428: --- https://bz.apache.org/bugzilla/show_bug.cgi?id=61295 I suspect quite a few more will come out of the woodwork... See TIKA-2430. > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086318#comment-16086318 ] Luis Filipe Nassif commented on TIKA-2428: -- That would be very nice! > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086045#comment-16086045 ] Tim Allison commented on TIKA-2428: --- bq. Our algorithm for recovering deleted files often recovers corruped ones (partially overwritten by OS) I was toying with the notion of a "smoke-test" level set of tests that would randomly permute some of the bytes within our test files to see if we could trigger this kind of thing. Sounds like you have a use case for us to do this... > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085979#comment-16085979 ] Tim Allison commented on TIKA-2428: --- Sorry. I misunderstood. Right. That's my belief. > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085965#comment-16085965 ] Luis Filipe Nassif commented on TIKA-2428: -- bq. If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 0, we need to test with a read, according to guava Let me clarify my comment, I mean if 20,000 bytes are requested to be skiped in a file with 10,000, it can return more than 10,000 (reproduced by your test), but no more than 20,000. > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085881#comment-16085881 ] Tim Allison commented on TIKA-2428: --- bq. Maybe there is an issue with IOUtils.skipFully() Y, completely. We need to defend against FileInputStream's potentially incorrect allegations, and we need to defend against an InputStream returning 0, which can mean either that it hit the end of the InputStream _or_ it just didn't skip anything for this particular call. So, y, my implementation of skipFully in POI is at fault here, and I need to fix it. > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085877#comment-16085877 ] Tim Allison commented on TIKA-2428: --- bq. I don't think the javadocs allow that. I think the javadocs warn about this for FileInputStream with the following, but I think the implementation is, um, less than ideal and in conflict with the behavior we'd expect from the javadocs for InputStream. bq. number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file This test passes for me: {noformat} @Test public void testFalseAllegationFromFileInputStream() throws IOException { File tmp = File.createTempFile("poi", ""); FileOutputStream fos = new FileOutputStream(tmp); for (int i = 0; i < 1; i++) { fos.write(2); } fos.flush(); fos.close(); assertEquals(1, tmp.length()); InputStream is = new FileInputStream(tmp); assertEquals(2, is.skip(2)); is.close(); tmp.delete(); } {noformat} > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085863#comment-16085863 ] Tim Allison commented on TIKA-2428: --- https://bz.apache.org/bugzilla/show_bug.cgi?id=61294 > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085852#comment-16085852 ] Luis Filipe Nassif commented on TIKA-2428: -- Strange, I don't think the javadocs allow that. Maybe there is an issue with IOUtils.skipFully() or TikaInputStream.skip()? Those emf files are deleted files recovered from one of our test images. Our algorithm for recovering deleted files often recovers corruped ones (partially overwritten by OS), so we have a lot of them! > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085742#comment-16085742 ] Tim Allison commented on TIKA-2428: --- bq. But I understood it can skip more than are remaining in the source, but no more than was requested, right? In one of the attached files, the first bad loop has {{requested}} == {{got}} (both 4,294,902,047). In the second time in the loop (and every one thereafter), {{requested}} == 4,294,902,047, {{got}} == 4,294,967,296. > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085730#comment-16085730 ] Tim Allison commented on TIKA-2428: --- I wonder why I didn't see this in our common crawl/govdocs1 corpus? When you process EMF, are those literally carved as standalone files or are they part of carved doc/ppt/xls? > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085665#comment-16085665 ] Luis Filipe Nassif commented on TIKA-2428: -- I just put the stacktrace, you found the cause. But I understood it can skip more than are remaining in the source, but no more than was requested, right? > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085640#comment-16085640 ] Tim Allison commented on TIKA-2428: --- Thank you, [~lfcnassif], for reporting this and finding the cause. >From the Javadocs for FileInputStream: {noformat} This method may skip more bytes than are remaining in the backing file. This produces no exception and the number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file. Attempting to read from the stream after skipping past the end will result in -1 indicating the end of the file. {noformat} >From the Javadocs for InputStream: {noformat} The skip method may, for a variety of reasons, end up skipping over some smaller number of bytes, possibly 0. This may result from any of a number of conditions; reaching end of file before n bytes have been skipped is only one possibility. The actual number of bytes skipped is returned. {noformat} If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 0, we need to test with a read, according to [guava|https://github.com/google/guava/blob/master/guava/src/com/google/common/io/ByteStreams.java#L779] > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files
[ https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085128#comment-16085128 ] Luis Filipe Nassif commented on TIKA-2428: -- Seems like the issue is at POI level. Threads are stuck at: {code} java.lang.Thread.State: RUNNABLE at java.io.FileInputStream.skip(Native Method) at java.io.BufferedInputStream.skip(Unknown Source) - locked <0x000717f30ac0> (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.skip(ProxyInputStream.java:117) at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:655) at java.io.FilterInputStream.skip(Unknown Source) at org.apache.poi.util.IOUtils.skipFully(IOUtils.java:364) at org.apache.poi.hemf.record.UnimplementedHemfRecord.init(UnimplementedHemfRecord.java:43) at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator._next(HemfExtractor.java:101) at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:77) at org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:60) at org.apache.tika.parser.microsoft.EMFParser.parse(EMFParser.java:82) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at dpf.sp.gpinf.indexer.parsers.IndexerDefaultParser.parse(IndexerDefaultParser.java:150) at dpf.sp.gpinf.indexer.io.ParsingReader$ParsingTask.run(ParsingReader.java:263) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {code} > EMFParser loops forever with corrupted files > > > Key: TIKA-2428 > URL: https://issues.apache.org/jira/browse/TIKA-2428 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.15, 1.16 >Reporter: Luis Filipe Nassif > Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf > > > EMFParser hangs with the attached corrupted EMF files. > Sorry [~talli...@apache.org]! Just now having time to test against our > forensic test corpus... -- This message was sent by Atlassian JIRA (v6.4.14#64029)