[
https://issues.apache.org/jira/browse/TIKA-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167012#comment-15167012
]
Nick Burch commented on TIKA-1873:
----------------------------------
Interesting stuff! I'd skip most container-based formats, and especially OLE2
formats though. With OLE2 the only bit you can be sure of is the 512/4096 (1
block) header at the start, which basically says "I'm OLE2". After that, you
can put the blocks in any order, so one file could have the first bit of word
data starting at 513 bytes, another could have that as the last 512 bytes of
the file, and both are valid!
> Test Cases failed when tika-mimetypes.xml is changed
> ----------------------------------------------------
>
> Key: TIKA-1873
> URL: https://issues.apache.org/jira/browse/TIKA-1873
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Antriksh Saxena
> Labels: test
>
> The test cases were failing when tika was built after updating the
> tika-mimetypes.xml. The failure logs are as follows.
> {code}
> TestContainerAwareDetector.testTruncatedFiles:395
> expected:<application/x-tika-msoffice> but was:<application/msword>
> TestMimeTypes.testOLE2Detection:138->assertTypeByData:1045
> expected:<application/[x-tika-msoffice]> but was:<application/[msword]>
> TestMimeTypes.testOldExcel:251->assertTypeByData:1045
> expected:<application/[x-tika-msoffice]> but was:<application/[msword]>
> TestMimeTypes.testVisioDetection:305->assertTypeByNameAndData:1071
> expected:<application/[vnd.visio]> but was:<application/[msword]>
> ExcelParserTest.testExcel95:320 expected:<application/[vnd.ms-excel]> but
> was:<application/[msword]>
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)