[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086436#comment-16086436
 ] 

Tim Allison commented on TIKA-2428:
---

https://bz.apache.org/bugzilla/show_bug.cgi?id=61295  

I suspect quite a few more will come out of the woodwork...

See TIKA-2430.

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086318#comment-16086318
 ] 

Luis Filipe Nassif commented on TIKA-2428:
--

That would be very nice!

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086045#comment-16086045
 ] 

Tim Allison commented on TIKA-2428:
---

bq. Our algorithm for recovering deleted files often recovers corruped ones 
(partially overwritten by OS)

I was toying with the notion of a "smoke-test" level set of tests that would 
randomly permute some of the bytes within our test files to see if we could 
trigger this kind of thing.  Sounds like you have a use case for us to do 
this...

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085979#comment-16085979
 ] 

Tim Allison commented on TIKA-2428:
---

Sorry.  I misunderstood.  Right.  That's my belief.

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085965#comment-16085965
 ] 

Luis Filipe Nassif commented on TIKA-2428:
--

bq. If bytes skipped is more than requested, we've hit EOF. If bytes skipped == 
0, we need to test with a read, according to guava
Let me clarify my comment, I mean if 20,000 bytes are requested to be skiped in 
a file with 10,000, it can return more than 10,000 (reproduced by your test), 
but no more than 20,000.

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085881#comment-16085881
 ] 

Tim Allison commented on TIKA-2428:
---

bq. Maybe there is an issue with IOUtils.skipFully()

Y, completely.  We need to defend against FileInputStream's potentially 
incorrect allegations, and we need to defend against an InputStream returning 
0, which can mean either that it hit the end of the InputStream _or_ it just 
didn't skip anything for this particular call.

So, y, my implementation of skipFully in POI is at fault here, and I need to 
fix it.

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085877#comment-16085877
 ] 

Tim Allison commented on TIKA-2428:
---

bq. I don't think the javadocs allow that. 

I think the javadocs warn about this for FileInputStream with the following, 
but I think the implementation is, um, less than ideal and in conflict with the 
behavior we'd expect from the javadocs for InputStream.

bq. number of bytes skipped may include some number of bytes that were beyond 
the EOF of the backing file

This test passes for me:
{noformat}
@Test
public void testFalseAllegationFromFileInputStream() throws IOException {
File tmp = File.createTempFile("poi", "");
FileOutputStream fos = new FileOutputStream(tmp);
for (int i = 0; i < 1; i++) {
fos.write(2);
}
fos.flush();
fos.close();
assertEquals(1, tmp.length());

InputStream is = new FileInputStream(tmp);
assertEquals(2, is.skip(2));
is.close();
tmp.delete();
}
{noformat}

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085863#comment-16085863
 ] 

Tim Allison commented on TIKA-2428:
---

https://bz.apache.org/bugzilla/show_bug.cgi?id=61294

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085852#comment-16085852
 ] 

Luis Filipe Nassif commented on TIKA-2428:
--

Strange, I don't think the javadocs allow that. Maybe there is an issue with 
IOUtils.skipFully() or TikaInputStream.skip()?

Those emf files are deleted files recovered from one of our test images. Our 
algorithm for recovering deleted files often recovers corruped ones (partially 
overwritten by OS), so we have a lot of them!

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085742#comment-16085742
 ] 

Tim Allison commented on TIKA-2428:
---

bq. But I understood it can skip more than are remaining in the source, but no 
more than was requested, right?

In one of the attached files, the first bad loop has {{requested}} == {{got}} 
(both 4,294,902,047).  In the second time in the loop (and every one 
thereafter), {{requested}} == 4,294,902,047, {{got}} == 4,294,967,296.

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085730#comment-16085730
 ] 

Tim Allison commented on TIKA-2428:
---

I wonder why I didn't see this in our common crawl/govdocs1 corpus?  When you 
process EMF, are those literally carved as standalone files or are they part of 
carved doc/ppt/xls?

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085665#comment-16085665
 ] 

Luis Filipe Nassif commented on TIKA-2428:
--

I just put the stacktrace, you found the cause. 

But I understood it can skip more than are remaining in the source, but no more 
than was requested, right?

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-13 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085640#comment-16085640
 ] 

Tim Allison commented on TIKA-2428:
---

Thank you, [~lfcnassif], for reporting this and finding the cause.

>From the Javadocs for FileInputStream:

{noformat}
This method may skip more bytes than are remaining in the backing file. This 
produces no exception and the number of bytes skipped may include some number 
of bytes that were beyond the EOF of the backing file. Attempting to read from 
the stream after skipping past the end will result in -1 indicating the end of 
the file.
{noformat}

>From the Javadocs for InputStream:
{noformat}
The skip method may, for a variety of reasons, end up skipping over some 
smaller number of bytes, possibly 0. This may result from any of a number of 
conditions; reaching end of file before n bytes have been skipped is only one 
possibility. The actual number of bytes skipped is returned.
{noformat}

If bytes skipped is more than requested, we've hit EOF.  If bytes skipped == 0, 
we need to test with a read, according to 
[guava|https://github.com/google/guava/blob/master/guava/src/com/google/common/io/ByteStreams.java#L779]

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2428) EMFParser loops forever with corrupted files

2017-07-12 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085128#comment-16085128
 ] 

Luis Filipe Nassif commented on TIKA-2428:
--

Seems like the issue is at POI level. Threads are stuck at:
{code}
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.skip(Native Method)
at java.io.BufferedInputStream.skip(Unknown Source)
- locked <0x000717f30ac0> (a java.io.BufferedInputStream)
at org.apache.tika.io.ProxyInputStream.skip(ProxyInputStream.java:117)
at org.apache.tika.io.TikaInputStream.skip(TikaInputStream.java:655)
at java.io.FilterInputStream.skip(Unknown Source)
at org.apache.poi.util.IOUtils.skipFully(IOUtils.java:364)
at 
org.apache.poi.hemf.record.UnimplementedHemfRecord.init(UnimplementedHemfRecord.java:43)
at 
org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator._next(HemfExtractor.java:101)
at 
org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:77)
at 
org.apache.poi.hemf.extractor.HemfExtractor$HemfRecordIterator.next(HemfExtractor.java:60)
at org.apache.tika.parser.microsoft.EMFParser.parse(EMFParser.java:82)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at 
dpf.sp.gpinf.indexer.parsers.IndexerDefaultParser.parse(IndexerDefaultParser.java:150)
at 
dpf.sp.gpinf.indexer.io.ParsingReader$ParsingTask.run(ParsingReader.java:263)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

{code}

> EMFParser loops forever with corrupted files
> 
>
> Key: TIKA-2428
> URL: https://issues.apache.org/jira/browse/TIKA-2428
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.15, 1.16
>Reporter: Luis Filipe Nassif
> Attachments: Carved-1285676.emf, Carved-1296288.emf, Carved-912866.emf
>
>
> EMFParser hangs with the attached corrupted EMF files.
> Sorry [~talli...@apache.org]! Just now having time to test against our 
> forensic test corpus...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)