[
https://issues.apache.org/jira/browse/TIKA-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875629#action_12875629
]
Nick Burch commented on TIKA-391:
---------------------------------
At any point in the OLE2 document tree, there should only ever be one document,
so you should be able to detect it with 100% reliability
To know what kind of OLE2 document you have, you do need to open it up at the
POIFS level, and look at the top level directory entries. Look for each name in
turn until you find one you know about, then you'll know what the document is.
For embeded documents, where the embeded files end up depends on the containing
document, but ExtractorFactory in POI should give you a good idea about how to
handle it for most of the formats
For an OOXML document (eg .xlsx, .docx), you need to look at the top level zip
file entries. You should again only have one kind of document at any point in
the tree, so you just need to look at the names until you spot one you know
about. Embeded documents are in theory fairly simple with OOXML, as they seem
to be stored as-is in the /embeddings/ subdirectory, but POI really needs a few
more example files before we can write comprehensive unit tests for this.
> Intermittent errors detecting xls files
> ---------------------------------------
>
> Key: TIKA-391
> URL: https://issues.apache.org/jira/browse/TIKA-391
> Project: Tika
> Issue Type: Bug
> Components: mime
> Affects Versions: 0.6
> Reporter: Simon Tyler
> Assignee: Chris A. Mattmann
> Fix For: 0.8
>
> Attachments: MimeTypes.java
>
>
> I am doing some testing of Tika 0.6 and noticed some odd results for the
> testEXCEL.xls file included in the test suite.
> 100 calls to the following code:
>
> is = new BufferedInputStream(new FileInputStream(filename));
>
> Metadata metadata = new Metadata();
> metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
>
> String type = tika.detect(is, metadata);
>
> Results in different matches as application/msword or
> application/vnd.ms-excel seemingly at random.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.