Andreas Meier created TIKA-2632:
-----------------------------------

             Summary: Analyze unknown govdocs files
                 Key: TIKA-2632
                 URL: https://issues.apache.org/jira/browse/TIKA-2632
             Project: Tika
          Issue Type: Improvement
            Reporter: Andreas Meier


I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following problems:

 

1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
[link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
144/144504.unk
272/272490.unk
430/430427.unk
(several more...)


2. Proprietary File Format: SigmaPlot Exchange File .jxf:
Magic: 0x8888000c4a5846
Example file in govdocs1:
975/975382.unk
975/975383.unk
 (several more...)


3. Bitflip or valid Magic for application/vnd.ms-excel.sheet.2
In one file (376/376222.unk) I found
0x0900040007001000
instead of
0x0900040000001000

I guess the bit just flipped for any reason (interception of the data or sth. 
else)
If the is might be recognized by any other files the magic  for 
application/vnd.ms-excel.sheet.2 should be adapted.


4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3
In file 611/611703.unk I found a 128-byte long header in front of the excel 
file.
therefore the file could not be recognized correclty by TIKA

After I cut the header, the file could be recognized and converted by TIKA.



Let me know if I should open a separate ticket for case 1.


If there is any better place (except the mailing lists) to publish the 
analyzation results let me know.

 

Regards

 

Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to