[ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440788#comment-16440788 ]
Tim Allison commented on TIKA-2632: ----------------------------------- bq. Turned out that someone else already investigated this case a month ago... And that someone else is none other than [~anjackson], a good friend of Tika. :) > Analyze unknown govdocs files > ----------------------------- > > Key: TIKA-2632 > URL: https://issues.apache.org/jira/browse/TIKA-2632 > Project: Tika > Issue Type: Improvement > Reporter: Andreas Meier > Priority: Minor > > I recently started to analyze randomly govdocs1 files that could not be > recognized by TIKA properly. > > This ticket should be used to identify problems with old or proprietary files > and to extend TIKA step-by-step if needed. > > Stumbled across the following filetypes/files: > > 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized > properly: > Found some mysterious files starting with 0xeddead0b and 0x0baddeed > Turned out that someone else already investigated this case a month ago: > [link > http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/] > The files are old PowerPoint. (PowerPoint 3.0 or 2.0) > I think these Magic-strings should be added tika-mimetypes.xml as well as > another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or > application/vnd.ms-powerpoint.3 ?) > Example files in govdocs1: > 144/144504.unk > 272/272490.unk > 430/430427.unk > (several more...) > 2. Proprietary File Format: SigmaPlot Exchange File .jxf: > Magic: 0x8888000c4a5846 > Example file in govdocs1: > 975/975382.unk > 975/975383.unk > (several more...) > 3. There are two old excel file types which are not recognized at the Moment > (application/vnd.ms-excel.sheet.2): > 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of > 0x0900040000001000 > 224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of > 0x0900040000001000 > The magic for application/vnd.ms-excel.sheet.2 should be adapted: > 0x02001000 > and > 0x07001000 > must be added. > Furthermore we have to check whether the parser can be adapted to process all > the mentioned files. > (LibreOffice can open all of these files) > 4. Special Header/Wrapper in front of application/vnd.ms-excel.sheet.3 > In file 611/611703.unk I found a 128-byte long header in front of the excel > file. > therefore the file could not be recognized correclty by TIKA > After I cut the header, the file could be recognized and converted by TIKA. > 5. SAS Data file > Example file: > 020/020505.unk > 6. AirSar Data (Airborne synthetic aperature Radar) > Example file: > 348/349489.unk (several more...) > 7. Advanced Data Format (ADF) > Used in CGNS (CFD General Notation System .cgns) > Example file: > 363/363966.unk > 8. Unknown Microsoft Word Document > Example file: > 202/202718.unk > (Recognized as Microsoft Word Document by Linux Magic) > 9. Unknown PowerPoint 3.0 file? > Example file: > 388/388212.unk > 10. Microsoft Compound File Binary File Format? > Example file > 857/857353.unk > Let me know if I should open a separate ticket for case 1. and 3.! > If there is any better place (except the mailing lists) to publish the > analyzation results let me know. > > Regards > > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)