[ 
https://issues.apache.org/jira/browse/TIKA-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875629#action_12875629
 ] 

Nick Burch commented on TIKA-391:
---------------------------------

At any point in the OLE2 document tree, there should only ever be one document, 
so you should be able to detect it with 100% reliability

To know what kind of OLE2 document you have, you do need to open it up at the 
POIFS level, and look at the top level directory entries. Look for each name in 
turn until you find one you know about, then you'll know what the document is. 
For embeded documents, where the embeded files end up depends on the containing 
document, but ExtractorFactory in POI should give you a good idea about how to 
handle it for most of the formats

For an OOXML document (eg .xlsx, .docx), you need to look at the top level zip 
file entries. You should again only have one kind of document at any point in 
the tree, so you just need to look at the names until you spot one you know 
about. Embeded documents are in theory fairly simple with OOXML, as they seem 
to be stored as-is in the /embeddings/ subdirectory, but POI really needs a few 
more example files before we can write comprehensive unit tests for this.

> Intermittent errors detecting xls files
> ---------------------------------------
>
>                 Key: TIKA-391
>                 URL: https://issues.apache.org/jira/browse/TIKA-391
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 0.6
>            Reporter: Simon Tyler
>            Assignee: Chris A. Mattmann
>             Fix For: 0.8
>
>         Attachments: MimeTypes.java
>
>
> I am doing some testing of Tika 0.6 and noticed some odd results for the 
> testEXCEL.xls file included in the test suite. 
> 100 calls to the following code:
>  
>             is = new BufferedInputStream(new FileInputStream(filename));
>  
>             Metadata metadata = new Metadata();
>             metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>  
>             String type = tika.detect(is, metadata);
>  
> Results in different matches as application/msword or 
> application/vnd.ms-excel seemingly at random.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to