[jira] [Updated] (TIKA-2632) Analyze unknown govdocs files

2018-04-19 Thread Andreas Meier (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Meier updated TIKA-2632:

Description: 
I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
 [link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
 I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
 144/144504.unk
 272/272490.unk
 430/430427.unk
 (several more...)

2. Proprietary File Format: SigmaPlot Exchange File .jxf:
 Magic: 0x000c4a5846
 Example file in govdocs1:
 975/975382.unk
 975/975383.unk
  (several more...)

3. There are two old excel file types which are not recognized at the Moment 
(application/vnd.ms-excel.sheet.2):

376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 
0x090004001000

224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of 
0x090004001000

The magic for application/vnd.ms-excel.sheet.2 should be adapted:
 0x02001000
 and
 0x07001000
 must be added.

Furthermore we have to check whether the parser can be adapted to process all 
the mentioned files.

(LibreOffice can open all of these files)

4. 128-byte header in front of files 
 There are several files in the corpus that start with a 128-byte long header 
in front of the actual file.
 The header contains the filename and a specific filetype (TEXTXCEL for 4.1 and 
SLD3PPT3 for 4.2)

4.1 In file 611/611703.unk I found a 128-byte long header in front of the excel 
file. (application/vnd.ms-excel.sheet.3)
 therefore the file could not be recognized correclty by TIKA
 After I cut the header, the file could be recognized and converted by TIKA.

4.2 The following files are old PowerPoint files with a leading 128-byte header
 388/388212.unk
 775/775724.unk
 790/790351.unk

5. SAS Data file
 Example file:
 020/020505.unk

6. AirSar Data (Airborne synthetic aperature Radar)
 Example file:
 348/349489.unk (several more...)

7. Advanced Data Format (ADF)
 Used in CGNS (CFD General Notation System .cgns)
 Example file:
 363/363966.unk

8. Unknown (old?) Microsoft Word Document
 Example file:
 202/202718.unk
 (Recognized as Microsoft Word Document by Linux Magic)

9. Raw weather data by nws noaa
 SXXX.. KWAL ...

Example files:
 136/136247.unk
 400/400289.unk

10. Microsoft Compound File Binary File Format?
 Files of this type have already been handled by [~talli...@mitre.org] in 
TIKA-1813
 Example file:
 857/857353.unk


11. Old OCLC Bibliotheca files
Bibliography files containing books, prints, songs, ...
Example files:
114/114440.unk
030/030871.unk
 

12. Self describing data sets file
Magic: SDDS
Contains data in ASCII or binary format, can be extracted via SDDS Toolbox 
(there is even a Java SDDS library, proprietary license)
[link 
https://ops.aps.anl.gov/SDDSIntroTalk/slides.html|https://ops.aps.anl.gov/SDDSIntroTalk/slides.html]
[link 
https://www.aps.anl.gov/Accelerator-Operations-Physics/Software|https://www.aps.anl.gov/Accelerator-Operations-Physics/Software]
Example file:
599/599463.unk


Let me know if I should open a separate ticket for case 1. and 3.!

If there is any better place (except the mailing lists) to publish the 
analyzation results let me know.

 

Regards

 

Andreas

  was:
I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
 [link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
 I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
 144/144504.unk
 272/272490.unk
 430/430427.unk
 (several more...)

2. Proprietary File Format: SigmaPlot Exchange File 

[jira] [Commented] (TIKA-2635) Require imageMagick path be specified on Windows OS

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444353#comment-16444353
 ] 

Hudson commented on TIKA-2635:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #238 (See 
[https://builds.apache.org/job/tika-2.x-windows/238/])
TIKA-2635 -- require that user specify path for imagemagick on windows 
(tallison: rev dae7c0100df748c181729ac921e43e0917709a66)
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


> Require imageMagick path be specified on Windows OS
> ---
>
> Key: TIKA-2635
> URL: https://issues.apache.org/jira/browse/TIKA-2635
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> Our optional image preprocessing with imagemagick can run into problems on 
> Windows machines where the executable `convert` is a system command, not the 
> imagemagick executable.
> I propose that on Windows, we require users to specify a path for imagemagick.
> If there are other system 'convert' commands on other operating systems, 
> should we require that imagemagick be in the path?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2634) Upgrade Jackson to 2.9.5

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444303#comment-16444303
 ] 

Hudson commented on TIKA-2634:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #23 (See 
[https://builds.apache.org/job/tika-branch-1x/23/])
TIKA-2634 upgrade Jackson to 2.9.5 (tallison: 
[https://github.com/apache/tika/commit/bb7adacb2b19e7d5f645ed3d6dbcde4d461f1f44])
* (edit) tika-translate/pom.xml
* (edit) tika-parsers/pom.xml
* (edit) tika-parent/pom.xml
* (edit) tika-nlp/pom.xml
TIKA-2634 upgrade Jackson to 2.9.5 (tallison: 
[https://github.com/apache/tika/commit/a8b41d30e2d2065f53f6c0707980586d64f246f2])
* (edit) CHANGES.txt


> Upgrade Jackson to 2.9.5
> 
>
> Key: TIKA-2634
> URL: https://issues.apache.org/jira/browse/TIKA-2634
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2635) Require imageMagick path be specified on Windows OS

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444304#comment-16444304
 ] 

Hudson commented on TIKA-2635:
--

SUCCESS: Integrated in Jenkins build tika-branch-1x #23 (See 
[https://builds.apache.org/job/tika-branch-1x/23/])
TIKA-2635 -- require that user specify path for imagemagick on windows 
(tallison: 
[https://github.com/apache/tika/commit/e84d0d56ada5f156ae308347b0c77c0ff281a9b7])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


> Require imageMagick path be specified on Windows OS
> ---
>
> Key: TIKA-2635
> URL: https://issues.apache.org/jira/browse/TIKA-2635
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> Our optional image preprocessing with imagemagick can run into problems on 
> Windows machines where the executable `convert` is a system command, not the 
> imagemagick executable.
> I propose that on Windows, we require users to specify a path for imagemagick.
> If there are other system 'convert' commands on other operating systems, 
> should we require that imagemagick be in the path?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2635) Require imageMagick path be specified on Windows OS

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444284#comment-16444284
 ] 

Hudson commented on TIKA-2635:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1473 (See 
[https://builds.apache.org/job/Tika-trunk/1473/])
TIKA-2635 -- require that user specify path for imagemagick on windows 
(tallison: 
[https://github.com/apache/tika/commit/dae7c0100df748c181729ac921e43e0917709a66])
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


> Require imageMagick path be specified on Windows OS
> ---
>
> Key: TIKA-2635
> URL: https://issues.apache.org/jira/browse/TIKA-2635
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> Our optional image preprocessing with imagemagick can run into problems on 
> Windows machines where the executable `convert` is a system command, not the 
> imagemagick executable.
> I propose that on Windows, we require users to specify a path for imagemagick.
> If there are other system 'convert' commands on other operating systems, 
> should we require that imagemagick be in the path?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


RE: [apache/tika] Fix for TIKA-2570 contributed by ewanmellor. (#219)

2018-04-19 Thread Allison, Timothy B.
I think I've answered my own question.  Unless there are objections, I'll 
cancel the RC2 vote and roll RC3 tomorrow.  

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, April 19, 2018 7:03 AM
To: dev@tika.apache.org
Subject: FW: [apache/tika] Fix for TIKA-2570 contributed by ewanmellor. (#219)

Cancel RC2 and respin RC3?

Deserialization vulnerability is a doozy.


From: Julian Reschke [mailto:notificati...@github.com]
Sent: Thursday, April 19, 2018 6:35 AM
To: apache/tika 
Cc: Tim Allison ; State change 

Subject: Re: [apache/tika] Fix for TIKA-2570 contributed by ewanmellor. (#219)


@cygri - you probably should open a separate ticket 
on Jira.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on 
GitHub, or mute 
the 
thread.


[jira] [Resolved] (TIKA-2635) Require imageMagick path be specified on Windows OS

2018-04-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2635.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> Require imageMagick path be specified on Windows OS
> ---
>
> Key: TIKA-2635
> URL: https://issues.apache.org/jira/browse/TIKA-2635
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> Our optional image preprocessing with imagemagick can run into problems on 
> Windows machines where the executable `convert` is a system command, not the 
> imagemagick executable.
> I propose that on Windows, we require users to specify a path for imagemagick.
> If there are other system 'convert' commands on other operating systems, 
> should we require that imagemagick be in the path?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2634) Upgrade Jackson to 2.9.5

2018-04-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2634.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.18

> Upgrade Jackson to 2.9.5
> 
>
> Key: TIKA-2634
> URL: https://issues.apache.org/jira/browse/TIKA-2634
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
> Fix For: 1.18, 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2635) Require imageMagick path be specified on Windows OS

2018-04-19 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-2635:
--
Priority: Minor  (was: Major)

> Require imageMagick path be specified on Windows OS
> ---
>
> Key: TIKA-2635
> URL: https://issues.apache.org/jira/browse/TIKA-2635
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> Our optional image preprocessing with imagemagick can run into problems on 
> Windows machines where the executable `convert` is a system command, not the 
> imagemagick executable.
> I propose that on Windows, we require users to specify a path for imagemagick.
> If there are other system 'convert' commands on other operating systems, 
> should we require that imagemagick be in the path?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2634) Upgrade Jackson to 2.9.5

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444134#comment-16444134
 ] 

Hudson commented on TIKA-2634:
--

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #236 (See 
[https://builds.apache.org/job/tika-2.x-windows/236/])
TIKA-2634 upgrade jackson to 2.9.5 (tallison: rev 
d6503f54d19526c7d5b807ce01ed92ed69f60cf8)
* (edit) tika-nlp/pom.xml
* (edit) tika-parent/pom.xml
* (edit) tika-translate/pom.xml
* (edit) CHANGES.txt
* (edit) tika-parsers/pom.xml


> Upgrade Jackson to 2.9.5
> 
>
> Key: TIKA-2634
> URL: https://issues.apache.org/jira/browse/TIKA-2634
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2635) Require imageMagick path be specified on Windows OS

2018-04-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2635:
-

 Summary: Require imageMagick path be specified on Windows OS
 Key: TIKA-2635
 URL: https://issues.apache.org/jira/browse/TIKA-2635
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison


Our optional image preprocessing with imagemagick can run into problems on 
Windows machines where the executable `convert` is a system command, not the 
imagemagick executable.

I propose that on Windows, we require users to specify a path for imagemagick.

If there are other system 'convert' commands on other operating systems, should 
we require that imagemagick be in the path?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2634) Upgrade Jackson to 2.9.5

2018-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444074#comment-16444074
 ] 

Hudson commented on TIKA-2634:
--

UNSTABLE: Integrated in Jenkins build Tika-trunk #1472 (See 
[https://builds.apache.org/job/Tika-trunk/1472/])
TIKA-2634 upgrade jackson to 2.9.5 (tallison: 
[https://github.com/apache/tika/commit/d6503f54d19526c7d5b807ce01ed92ed69f60cf8])
* (edit) tika-parent/pom.xml
* (edit) tika-parsers/pom.xml
* (edit) tika-translate/pom.xml
* (edit) tika-nlp/pom.xml
* (edit) CHANGES.txt


> Upgrade Jackson to 2.9.5
> 
>
> Key: TIKA-2634
> URL: https://issues.apache.org/jira/browse/TIKA-2634
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2570) Tika 1.17 uses vulnerable Jackson version 2.9.2

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443992#comment-16443992
 ] 

ASF GitHub Bot commented on TIKA-2570:
--

tballison commented on issue #219: Fix for TIKA-2570 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/219#issuecomment-382718438
 
 
   @reschke +1 Just did this 
[TIKA-2634](https://issues.apache.org/jira/browse/TIKA-2634)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika 1.17 uses vulnerable Jackson version 2.9.2
> ---
>
> Key: TIKA-2570
> URL: https://issues.apache.org/jira/browse/TIKA-2570
> Project: Tika
>  Issue Type: Task
>Reporter: Julian Reschke
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> See https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-17485



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2570) Tika 1.17 uses vulnerable Jackson version 2.9.2

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443993#comment-16443993
 ] 

ASF GitHub Bot commented on TIKA-2570:
--

tballison commented on issue #219: Fix for TIKA-2570 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/219#issuecomment-382718438
 
 
   @reschke +1 Just did this 
[TIKA-2634](https://issues.apache.org/jira/browse/TIKA-2634)  Thank you @cygri 
for alerting us!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika 1.17 uses vulnerable Jackson version 2.9.2
> ---
>
> Key: TIKA-2570
> URL: https://issues.apache.org/jira/browse/TIKA-2570
> Project: Tika
>  Issue Type: Task
>Reporter: Julian Reschke
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> See https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-17485



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2634) Upgrade Jackson to 2.9.5

2018-04-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443994#comment-16443994
 ] 

Tim Allison commented on TIKA-2634:
---

[~cygri] made this recommendation on TIKA-2570's github PR

> Upgrade Jackson to 2.9.5
> 
>
> Key: TIKA-2634
> URL: https://issues.apache.org/jira/browse/TIKA-2634
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TIKA-2634) Upgrade Jackson to 2.9.5

2018-04-19 Thread Tim Allison (JIRA)
Tim Allison created TIKA-2634:
-

 Summary: Upgrade Jackson to 2.9.5
 Key: TIKA-2634
 URL: https://issues.apache.org/jira/browse/TIKA-2634
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


FW: [apache/tika] Fix for TIKA-2570 contributed by ewanmellor. (#219)

2018-04-19 Thread Allison, Timothy B.
Cancel RC2 and respin RC3?

Deserialization vulnerability is a doozy.


From: Julian Reschke [mailto:notificati...@github.com]
Sent: Thursday, April 19, 2018 6:35 AM
To: apache/tika 
Cc: Tim Allison ; State change 

Subject: Re: [apache/tika] Fix for TIKA-2570 contributed by ewanmellor. (#219)


@cygri - you probably should open a separate ticket 
on Jira.

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on 
GitHub, or mute 
the 
thread.


[jira] [Commented] (TIKA-2570) Tika 1.17 uses vulnerable Jackson version 2.9.2

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443858#comment-16443858
 ] 

ASF GitHub Bot commented on TIKA-2570:
--

reschke commented on issue #219: Fix for TIKA-2570 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/219#issuecomment-382688818
 
 
   @cygri - you probably should open a separate ticket on Jira.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika 1.17 uses vulnerable Jackson version 2.9.2
> ---
>
> Key: TIKA-2570
> URL: https://issues.apache.org/jira/browse/TIKA-2570
> Project: Tika
>  Issue Type: Task
>Reporter: Julian Reschke
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> See https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-17485



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2570) Tika 1.17 uses vulnerable Jackson version 2.9.2

2018-04-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443774#comment-16443774
 ] 

ASF GitHub Bot commented on TIKA-2570:
--

cygri commented on issue #219: Fix for TIKA-2570 contributed by ewanmellor.
URL: https://github.com/apache/tika/pull/219#issuecomment-382667089
 
 
   Looks like 2.9.4 might have problems too, fixed in 2.9.5: 
https://nvd.nist.gov/vuln/detail/CVE-2018-7489


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Tika 1.17 uses vulnerable Jackson version 2.9.2
> ---
>
> Key: TIKA-2570
> URL: https://issues.apache.org/jira/browse/TIKA-2570
> Project: Tika
>  Issue Type: Task
>Reporter: Julian Reschke
>Priority: Minor
> Fix For: 1.18, 2.0.0
>
>
> See https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-17485



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2632) Analyze unknown govdocs files

2018-04-19 Thread Andreas Meier (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Meier updated TIKA-2632:

Description: 
I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
 [link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
 I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
 144/144504.unk
 272/272490.unk
 430/430427.unk
 (several more...)

2. Proprietary File Format: SigmaPlot Exchange File .jxf:
 Magic: 0x000c4a5846
 Example file in govdocs1:
 975/975382.unk
 975/975383.unk
  (several more...)

3. There are two old excel file types which are not recognized at the Moment 
(application/vnd.ms-excel.sheet.2):

376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 
0x090004001000

224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of 
0x090004001000

The magic for application/vnd.ms-excel.sheet.2 should be adapted:
 0x02001000
 and
 0x07001000
 must be added.

Furthermore we have to check whether the parser can be adapted to process all 
the mentioned files.

(LibreOffice can open all of these files)

4. 128-byte header in front of files 
 There are several files in the corpus that start with a 128-byte long header 
in front of the actual file.
 The header contains the filename and a specific filetype (TEXTXCEL for 4.1 and 
SLD3PPT3 for 4.2)

4.1 In file 611/611703.unk I found a 128-byte long header in front of the excel 
file. (application/vnd.ms-excel.sheet.3)
 therefore the file could not be recognized correclty by TIKA
 After I cut the header, the file could be recognized and converted by TIKA.

4.2 The following files are old PowerPoint files with a leading 128-byte header
 388/388212.unk
 775/775724.unk
 790/790351.unk

5. SAS Data file
 Example file:
 020/020505.unk

6. AirSar Data (Airborne synthetic aperature Radar)
 Example file:
 348/349489.unk (several more...)

7. Advanced Data Format (ADF)
 Used in CGNS (CFD General Notation System .cgns)
 Example file:
 363/363966.unk

8. Unknown (old?) Microsoft Word Document
 Example file:
 202/202718.unk
 (Recognized as Microsoft Word Document by Linux Magic)

9. Raw weather data by nws noaa
 SXXX.. KWAL ...

Example files:
 136/136247.unk
 400/400289.unk

10. Microsoft Compound File Binary File Format?
 Files of this type have already been handled by [~talli...@mitre.org] in 
TIKA-1813
 Example file:
 857/857353.unk


11. Old OCLC Bibliotheca files
Bibliography files containing books, prints, songs, ...
Example files:
114/114440.unk
030/030871.unk
 

Let me know if I should open a separate ticket for case 1. and 3.!

If there is any better place (except the mailing lists) to publish the 
analyzation results let me know.

 

Regards

 

Andreas

  was:
I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
 [link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
 I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
 144/144504.unk
 272/272490.unk
 430/430427.unk
 (several more...)

2. Proprietary File Format: SigmaPlot Exchange File .jxf:
 Magic: 0x000c4a5846
 Example file in govdocs1:
 975/975382.unk
 975/975383.unk
  (several more...)

3. There are two old excel file types which are not recognized at the Moment 
(application/vnd.ms-excel.sheet.2):

376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 
0x090004001000

224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of 
0x090004001000

The magic for 

[jira] [Updated] (TIKA-2632) Analyze unknown govdocs files

2018-04-19 Thread Andreas Meier (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Meier updated TIKA-2632:

Description: 
I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
 1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
 [link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
 I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
 144/144504.unk
 272/272490.unk
 430/430427.unk
 (several more...)

2. Proprietary File Format: SigmaPlot Exchange File .jxf:
 Magic: 0x000c4a5846
 Example file in govdocs1:
 975/975382.unk
 975/975383.unk
  (several more...)

3. There are two old excel file types which are not recognized at the Moment 
(application/vnd.ms-excel.sheet.2):

376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 
0x090004001000

224/224485.unk and 615/615187.unk start with 0x0900040002001000 instead of 
0x090004001000

The magic for application/vnd.ms-excel.sheet.2 should be adapted:
 0x02001000
 and
 0x07001000
 must be added.

Furthermore we have to check whether the parser can be adapted to process all 
the mentioned files.

(LibreOffice can open all of these files)

4. 128-byte header in front of files 
 There are several files in the corpus that start with a 128-byte long header 
in front of the actual file.
 The header contains the filename and a specific filetype (TEXTXCEL for 4.1 and 
SLD3PPT3 for 4.2)

4.1 In file 611/611703.unk I found a 128-byte long header in front of the excel 
file. (application/vnd.ms-excel.sheet.3)
 therefore the file could not be recognized correclty by TIKA
 After I cut the header, the file could be recognized and converted by TIKA.

4.2 The following files are old PowerPoint files with a leading 128-byte header
 388/388212.unk
 775/775724.unk
 790/790351.unk

5. SAS Data file
 Example file:
 020/020505.unk

6. AirSar Data (Airborne synthetic aperature Radar)
 Example file:
 348/349489.unk (several more...)

7. Advanced Data Format (ADF)
 Used in CGNS (CFD General Notation System .cgns)
 Example file:
 363/363966.unk

8. Unknown (old?) Microsoft Word Document
 Example file:
 202/202718.unk
 (Recognized as Microsoft Word Document by Linux Magic)

9. Raw weather data by nws noaa
 SXXX.. KWAL ...

Example files:
 136/136247.unk
 400/400289.unk

10. Microsoft Compound File Binary File Format?
 Files of this type have already been handled by [~talli...@mitre.org] in 
TIKA-1813
 Example file:
 857/857353.unk

 

11. Old OCLC Bibliotheca files

Example files:

 

Let me know if I should open a separate ticket for case 1. and 3.!

If there is any better place (except the mailing lists) to publish the 
analyzation results let me know.

 

Regards

 

Andreas

  was:
I recently started to analyze randomly govdocs1 files that could not be 
recognized by TIKA properly.

 

This ticket should be used to identify problems with old or proprietary files 
and to extend TIKA step-by-step if needed.

 

Stumbled across the following filetypes/files:

 
1. Old PowerPoint files (I expect Version 2.0 or 3.0) are not recognized 
properly:

Found some mysterious files starting with 0xeddead0b and 0x0baddeed

Turned out that someone else already investigated this case a month ago:
[link 
http://anjackson.net/2018/03/15/story-of-a-bad-deed/|http://anjackson.net/2018/03/15/story-of-a-bad-deed/]

The files are old PowerPoint. (PowerPoint 3.0 or 2.0)
I think these Magic-strings should be added tika-mimetypes.xml as well as 
another PowerPoint mime-type. (maybe application/vnd.ms-powerpoint.2 or 
application/vnd.ms-powerpoint.3 ?)

Example files in govdocs1: 
144/144504.unk
272/272490.unk
430/430427.unk
(several more...)


2. Proprietary File Format: SigmaPlot Exchange File .jxf:
Magic: 0x000c4a5846
Example file in govdocs1:
975/975382.unk
975/975383.unk
 (several more...)


3. There are two old excel file types which are not recognized at the Moment 
(application/vnd.ms-excel.sheet.2):

376/376222.unk and 622/62252.unk start with 0x0900040007001000 instead of 
0x090004001000

224/224485.unk and 615/615187.unk start with  0x0900040002001000 instead of 
0x090004001000

The magic for application/vnd.ms-excel.sheet.2 should be adapted:
0x02001000
and
0x07001000
must be added.

Furthermore we have