[ https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Meier updated TIKA-2632: -------------------------------- Description: I recently started to analyze randomly govdocs1 files that could not be recognized by TIKA properly. This ticket should be used to identify problems with old or proprietary files and to extend TIKA step-by-step if needed. Stumbled across the following filetypes/files: 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly: Found some mysterious files starting with 0xeddead0b and 0x0baddeed Turned out that someone else already investigated this case a month ago: [link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/] The files are old PowerPoint. (PowerPoint 3.0 or 2.0) I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?) Example files in govdocs1: 144/144504.unk 272/272490.unk 430/430427.unk (several more...) 2. Proprietary File Format: SigmaPlot Exchange File .jxf: Magic: 0x8888000c4a5846 Example file in govdocs1: 975/975382.unk 975/975383.unk (several more...) 3. There are two old excel file types which are not recognized at the Moment (application/vnd.ms-excel.sheet.2): 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 0x0900040000001000 224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of 0x0900040000001000 The magic for application/vnd.ms-excel.sheet.2 should be adapted: 0x02001000 and 0x07001000 must be added. Furthermore we have to check whether the parser can be adapted to process all the mentioned files. (LibreOffice can open all of these files) 4. 128-byte header in front of files There are several files in the corpus that start with a 128-byte long header in front of the actual file. The header contains the filename and a specific filetype (TEXTXCEL for 4.1 and SLD3PPT3 for 4.2) 4.1 In file 611/611703.unk I found a 128-byte long header in front of the excel file. (application/vnd.ms-excel.sheet.3) therefore the file could not be recognized correclty by TIKA After I cut the header, the file could be recognized and converted by TIKA. 4.2 The following files are old PowerPoint files with a leading 128-byte header 388/388212.unk 775/775724.unk 790/790351.unk 5. SAS Data file Example file: 020/020505.unk 6. AirSar Data (Airborne synthetic aperature Radar) Example file: 348/349489.unk (several more...) 7. Advanced Data Format (ADF) Used in CGNS (CFD General Notation System .cgns) Example file: 363/363966.unk 8. Unknown (old?) Microsoft Word Document Example file: 202/202718.unk (Recognized as Microsoft Word Document by Linux Magic) 9. Raw weather data by nws noaa SXXX.. KWAL ... Example files: 136/136247.unk 400/400289.unk 10. Microsoft Compound File Binary File Format? Files of this type have already been handled by [~talli...@mitre.org] in TIKA-1813 Example file: 857/857353.unk 11. Old OCLC Bibliotheca files Bibliography files containing books, prints, songs, ... Example files: 114/114440.unk 030/030871.unk 12. Self describing data sets file Magic: SDDS Contains data in ASCII or binary format, can be extracted via SDDS Toolbox (there is even a Java SDDS library, proprietary license) [link https://ops.aps.anl.gov/SDDSIntroTalk/slides.html|https://ops.aps.anl.gov/SDDSIntroTalk/slides.html] [link https://www.aps.anl.gov/Accelerator-Operations-Physics/Software|https://www.aps.anl.gov/Accelerator-Operations-Physics/Software] Example file: 599/599463.unk Let me know if I should open a separate ticket for case 1. and 3.! If there is any better place (except the mailing lists) to publish the analyzation results let me know. Regards Andreas was: I recently started to analyze randomly govdocs1 files that could not be recognized by TIKA properly. This ticket should be used to identify problems with old or proprietary files and to extend TIKA step-by-step if needed. Stumbled across the following filetypes/files: 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized properly: Found some mysterious files starting with 0xeddead0b and 0x0baddeed Turned out that someone else already investigated this case a month ago: [link http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/] The files are old PowerPoint. (PowerPoint 3.0 or 2.0) I think these Magic-strings should be added tika-mimetypes.xml as well as another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or application/vnd.ms-powerpoint.3 ?) Example files in govdocs1: 144/144504.unk 272/272490.unk 430/430427.unk (several more...) 2. Proprietary File Format: SigmaPlot Exchange File .jxf: Magic: 0x8888000c4a5846 Example file in govdocs1: 975/975382.unk 975/975383.unk (several more...) 3. There are two old excel file types which are not recognized at the Moment (application/vnd.ms-excel.sheet.2): 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 0x0900040000001000 224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of 0x0900040000001000 The magic for application/vnd.ms-excel.sheet.2 should be adapted: 0x02001000 and 0x07001000 must be added. Furthermore we have to check whether the parser can be adapted to process all the mentioned files. (LibreOffice can open all of these files) 4. 128-byte header in front of files There are several files in the corpus that start with a 128-byte long header in front of the actual file. The header contains the filename and a specific filetype (TEXTXCEL for 4.1 and SLD3PPT3 for 4.2) 4.1 In file 611/611703.unk I found a 128-byte long header in front of the excel file. (application/vnd.ms-excel.sheet.3) therefore the file could not be recognized correclty by TIKA After I cut the header, the file could be recognized and converted by TIKA. 4.2 The following files are old PowerPoint files with a leading 128-byte header 388/388212.unk 775/775724.unk 790/790351.unk 5. SAS Data file Example file: 020/020505.unk 6. AirSar Data (Airborne synthetic aperature Radar) Example file: 348/349489.unk (several more...) 7. Advanced Data Format (ADF) Used in CGNS (CFD General Notation System .cgns) Example file: 363/363966.unk 8. Unknown (old?) Microsoft Word Document Example file: 202/202718.unk (Recognized as Microsoft Word Document by Linux Magic) 9. Raw weather data by nws noaa SXXX.. KWAL ... Example files: 136/136247.unk 400/400289.unk 10. Microsoft Compound File Binary File Format? Files of this type have already been handled by [~talli...@mitre.org] in TIKA-1813 Example file: 857/857353.unk 11. Old OCLC Bibliotheca files Bibliography files containing books, prints, songs, ... Example files: 114/114440.unk 030/030871.unk Let me know if I should open a separate ticket for case 1. and 3.! If there is any better place (except the mailing lists) to publish the analyzation results let me know. Regards Andreas > Analyze unknown govdocs files > ----------------------------- > > Key: TIKA-2632 > URL: https://issues.apache.org/jira/browse/TIKA-2632 > Project: Tika > Issue Type: Improvement > Reporter: Andreas Meier > Priority: Minor > > I recently started to analyze randomly govdocs1 files that could not be > recognized by TIKA properly. > > This ticket should be used to identify problems with old or proprietary files > and to extend TIKA step-by-step if needed. > > Stumbled across the following filetypes/files: > > 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized > properly: > Found some mysterious files starting with 0xeddead0b and 0x0baddeed > Turned out that someone else already investigated this case a month ago: > [link > http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/] > The files are old PowerPoint. (PowerPoint 3.0 or 2.0) > I think these Magic-strings should be added tika-mimetypes.xml as well as > another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or > application/vnd.ms-powerpoint.3 ?) > Example files in govdocs1: > 144/144504.unk > 272/272490.unk > 430/430427.unk > (several more...) > 2. Proprietary File Format: SigmaPlot Exchange File .jxf: > Magic: 0x8888000c4a5846 > Example file in govdocs1: > 975/975382.unk > 975/975383.unk > (several more...) > 3. There are two old excel file types which are not recognized at the Moment > (application/vnd.ms-excel.sheet.2): > 376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of > 0x0900040000001000 > 224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of > 0x0900040000001000 > The magic for application/vnd.ms-excel.sheet.2 should be adapted: > 0x02001000 > and > 0x07001000 > must be added. > Furthermore we have to check whether the parser can be adapted to process all > the mentioned files. > (LibreOffice can open all of these files) > 4. 128-byte header in front of files > There are several files in the corpus that start with a 128-byte long header > in front of the actual file. > The header contains the filename and a specific filetype (TEXTXCEL for 4.1 > and SLD3PPT3 for 4.2) > 4.1 In file 611/611703.unk I found a 128-byte long header in front of the > excel file. (application/vnd.ms-excel.sheet.3) > therefore the file could not be recognized correclty by TIKA > After I cut the header, the file could be recognized and converted by TIKA. > 4.2 The following files are old PowerPoint files with a leading 128-byte > header > 388/388212.unk > 775/775724.unk > 790/790351.unk > 5. SAS Data file > Example file: > 020/020505.unk > 6. AirSar Data (Airborne synthetic aperature Radar) > Example file: > 348/349489.unk (several more...) > 7. Advanced Data Format (ADF) > Used in CGNS (CFD General Notation System .cgns) > Example file: > 363/363966.unk > 8. Unknown (old?) Microsoft Word Document > Example file: > 202/202718.unk > (Recognized as Microsoft Word Document by Linux Magic) > 9. Raw weather data by nws noaa > SXXX.. KWAL ... > Example files: > 136/136247.unk > 400/400289.unk > 10. Microsoft Compound File Binary File Format? > Files of this type have already been handled by [~talli...@mitre.org] in > TIKA-1813 > Example file: > 857/857353.unk > 11. Old OCLC Bibliotheca files > Bibliography files containing books, prints, songs, ... > Example files: > 114/114440.unk > 030/030871.unk > > 12. Self describing data sets file > Magic: SDDS > Contains data in ASCII or binary format, can be extracted via SDDS Toolbox > (there is even a Java SDDS library, proprietary license) > [link > https://ops.aps.anl.gov/SDDSIntroTalk/slides.html|https://ops.aps.anl.gov/SDDSIntroTalk/slides.html] > [link > https://www.aps.anl.gov/Accelerator-Operations-Physics/Software|https://www.aps.anl.gov/Accelerator-Operations-Physics/Software] > Example file: > 599/599463.unk > Let me know if I should open a separate ticket for case 1. and 3.! > If there is any better place (except the mailing lists) to publish the > analyzation results let me know. > > Regards > > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005)