from:"Ray Gauss II"

Ray Gauss II created TIKA-1278:
--

 Summary: Expose PDF Avg Char and Spacing Tolerance Config Params
 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and it's configuration 
behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params


 [ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1278:
---

Description: 
{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and its configuration 
behavior.

  was:
{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and it's configuration 
behavior.


 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params


 [ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1278.


Resolution: Fixed

Resolved in r1589722.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and it's configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params


[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979700#comment-13979700
 ] 

Ray Gauss II edited comment on TIKA-1278 at 4/24/14 1:31 PM:
-

Resolved in r1589722.

The setting of {{PDF2XHTML}} params was also moved from {{PDF2XHTML.process}} 
to a new {{PDFParserConfig.configure}} method which should allow developers to 
extend {{PDFParserConfig}} for custom behavior.


was (Author: rgauss):
Resolved in r1589722.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (TIKA-1279) Missing return lines at output of SourceCodeParser


 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-1279:


  Assignee: Hong-Thai Nguyen

[~thaichat04], I believe we still have to support Java 6 and 
{{System.lineSeparator()}} appears to have been added in Java 7.

I think {{System.getProperty(line.separator)}} would be equivalent.

 Missing return lines at output of SourceCodeParser
 --

 Key: TIKA-1279
 URL: https://issues.apache.org/jira/browse/TIKA-1279
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Trivial
 Fix For: 1.6


 xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-03-24 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1151.


Resolution: Fixed

Resolved in r1580887.

 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-03-24 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Fix Version/s: 1.6

 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Description: 
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp



  was:
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp




 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.5-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Description: 
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.6-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp



  was:
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp




 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907100#comment-13907100
 ] 

Ray Gauss II commented on TIKA-1151:


This will create a few artifacts on the larger side, notably:
||Artifact||Size||
|tika-parsers-1.6-SNAPSHOT-tests.jar|33MB|
|tika-server-1.6-SNAPSHOT-tests.jar|6.8MB|

Not huge, but I thought I'd double check that no one has any issues with that 
before committing.

 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: Extract thumbnail from openxml office files

2014-01-08 Thread Ray Gauss II

Hi Hong-Thai,

It’s certainly worth investigating.  Several other formats can have embedded 
thumbnails as well so we could implement a generic thumbnail property.

We could probably store as something like a Base64 encoded string, but we’d 
likely want to place limits on the size and may need a thumbnail internet media 
type field as well to assist in decoding.

Unless others feel differently, I would say open a JIRA where we could start 
discussing the design of such a feature.

Thanks!

Ray


On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen 
(hong-thai.ngu...@polyspot.com) wrote:
  
 Hi all,
 I want to extract thumbnail image included in Open XML office  
 files. Apparently, we can do it by openxml4j: 
 http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx
   
 The question is : should we integrate thumbnail in default metadata  
 list of ooxml parsing result ?
  
  
 Thanks
  
 Hong-Thai

[jira] [Assigned] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1177:
--

Assignee: Ray Gauss II

 Add Matroska (mkv, mka) format detection
 

 Key: TIKA-1177
 URL: https://issues.apache.org/jira/browse/TIKA-1177
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.4
Reporter: Boris Naguet
Assignee: Ray Gauss II
Priority: Minor

 There's no mimetype detection for Matroska format, although it's a popular 
 video format.
 Here is some code I added in my custom mimetypes to detect them:
 {code}
   mime-type type=video/x-matroska
   glob pattern=*.mkv /
   magic priority=40
   match value=0x1A45DFA3934282886d6174726f736b61 
 type=string offset=0 /
   /magic
   /mime-type
   mime-type type=audio/x-matroska
   glob pattern=*.mka /
   /mime-type
 {code}
 I found the signature for the mkv on: 
 http://www.garykessler.net/library/file_sigs.html
 I was not able to find it clearly for mka, but detection by filename is still 
 useful.
 Although, the full spec is available here:
 http://matroska.org/technical/specs/index.html
 Maybe it's a bit more complex than this constant magic, but it works on my 
 tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ray Gauss II resolved TIKA-1177.

Resolution: Fixed
Fix Version/s: 1.5

Unfortunately that magic doesn't seem to be required in all MKV files. I tired
several utilities to convert various sources to MKV and none contained that
magic.

A magic value of {{0x1A45DFA3}} is present, but that's also present in WebM
which is extended from Matroska.

I've added Matroska mime-types based on just extension for now and also added
the WebM mime-type.

We can open other issues, linked to this one, for data detection of MKV and
WebM files if need be.

Resolved in r1529260.

Add Matroska (mkv, mka) format detection

Key: TIKA-1177
URL: https://issues.apache.org/jira/browse/TIKA-1177
Project: Tika
Issue Type: Improvement
Components: mime
Affects Versions: 1.4
Reporter: Boris Naguet
Assignee: Ray Gauss II
Priority: Minor
Fix For: 1.5

There's no mimetype detection for Matroska format, although it's a popular
video format.
Here is some code I added in my custom mimetypes to detect them:
{code}
mime-type type=video/x-matroska
glob pattern=*.mkv /
magic priority=40
match value=0x1A45DFA3934282886d6174726f736b61
type=string offset=0 /
/magic
/mime-type
mime-type type=audio/x-matroska
glob pattern=*.mka /
/mime-type
{code}
I found the signature for the mkv on:
http://www.garykessler.net/library/file_sigs.html
I was not able to find it clearly for mka, but detection by filename is still
useful.
Although, the full spec is available here:
http://matroska.org/technical/specs/index.html
Maybe it's a bit more complex than this constant magic, but it works on my
tests files.

--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-10-04 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1179.


Resolution: Cannot Reproduce
  Assignee: Ray Gauss II

I've just confirmed the described behavior in Tika 1.4, however, it appears the 
file is parsed just fine in 1.5!

You can verify by downloading a 1.5 snapshot of {{tika-app}} ([current 
link|https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.5-SNAPSHOT/tika-app-1.5-20130927.201341-30.jar]),
 running the app, i.e.:
{code}
java -jar tika-app-1.5-20130927.201341-30.jar
{code}
and dropping {{corrupt.mp3}} onto the app window.

 A corrupt mp3 file can cause an infinite loop in Mp3Parser
 --

 Key: TIKA-1179
 URL: https://issues.apache.org/jira/browse/TIKA-1179
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Marius Dumitru Florea
Assignee: Ray Gauss II
 Fix For: 1.5

 Attachments: corrupt.mp3


 I have a thread that indexes (among other things) files using Apache Sorl. 
 This thread hangs (still running but with no progress) when trying to extract 
 meta data from the mp3 file attached to this issue. Here are a couple of 
 thread dumps taken at various moments:
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
   at 
 org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
   at 
 org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.BufferedInputStream.fill(Unknown Source)
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb7094e8 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4618000]
java.lang.Thread.State: RUNNABLE
   at org.apache.tika.io.TailStream.skip(TailStream.java:133)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb1be170 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242

[jira] [Assigned] (TIKA-1170) Insufficiently specific magic for binary image/cgm files


 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1170:
--

Assignee: Ray Gauss II

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files


 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1170.


   Resolution: Fixed
Fix Version/s: 1.5

Added in r1519664.

Thanks!

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files


[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1375#comment-1375
 ] 

Ray Gauss II commented on TIKA-1170:


My mistake, that's an artifact of me manually applying the git patch.

It does, however, seem to indicate that we should have a unit test for the 
false positives.  Do you have a file which demonstrates that problem?

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (TIKA-1170) Insufficiently specific magic for binary image/cgm files


 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-1170:



 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files


 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1170.


Resolution: Fixed

Resolved in r1519792.

SVN did not like the html extension on the problem file.

Thanks again.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files


[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757000#comment-13757000
 ] 

Ray Gauss II commented on TIKA-1170:


Yes, but in this particular case I thought it might be better to explicitly 
change the file name so other developers don't fix the media type for that 
file in the future.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-1166) FLVParser NullPointerException

2013-08-28 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1166:
--

Assignee: Ray Gauss II

 FLVParser NullPointerException
 --

 Key: TIKA-1166
 URL: https://issues.apache.org/jira/browse/TIKA-1166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.2, 1.3, 1.4
 Environment: All
Reporter: david rapin
Assignee: Ray Gauss II
  Labels: easyfix
 Attachments: data.mp4

   Original Estimate: 10m
  Remaining Estimate: 10m

 On certain video files, the FLV parser throws an NPE on line 242.
 The piece of code causing this is the following:
 https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   metadata.set(entry.getKey(), entry.getValue().toString());
 243: }
 {noformat} 
 Which should probably be replaced by something like this:
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   if (entry.getValue() == null) continue;
 243:   metadata.set(entry.getKey(), entry.getValue().toString());
 244: }
 {noformat} 
 Exception trace :
 {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.video.FLVParser@58d9660d
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1166) FLVParser NullPointerException

2013-08-28 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1166.


   Resolution: Fixed
Fix Version/s: 1.5

I briefly tried a few methods of trimming the problem file's size but none 
reproduced the issue in the resulting file.

Committed a check for null in r1518318.

 FLVParser NullPointerException
 --

 Key: TIKA-1166
 URL: https://issues.apache.org/jira/browse/TIKA-1166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.2, 1.3, 1.4
 Environment: All
Reporter: david rapin
Assignee: Ray Gauss II
  Labels: easyfix
 Fix For: 1.5

 Attachments: data.mp4

   Original Estimate: 10m
  Remaining Estimate: 10m

 On certain video files, the FLV parser throws an NPE on line 242.
 The piece of code causing this is the following:
 https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   metadata.set(entry.getKey(), entry.getValue().toString());
 243: }
 {noformat} 
 Which should probably be replaced by something like this:
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   if (entry.getValue() == null) continue;
 243:   metadata.set(entry.getKey(), entry.getValue().toString());
 244: }
 {noformat} 
 Exception trace :
 {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.video.FLVParser@58d9660d
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1166) FLVParser NullPointerException

2013-08-22 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747529#comment-13747529
 ] 

Ray Gauss II commented on TIKA-1166:


Thanks.  Is there any chance you could get that down to under, say, 50k, while 
still demonstrating the failure so that we can include it in the dist and 
create a unit test against it?

 FLVParser NullPointerException
 --

 Key: TIKA-1166
 URL: https://issues.apache.org/jira/browse/TIKA-1166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.2, 1.3, 1.4
 Environment: All
Reporter: david rapin
  Labels: easyfix
 Attachments: data.mp4

   Original Estimate: 10m
  Remaining Estimate: 10m

 On certain video files, the FLV parser throws an NPE on line 242.
 The piece of code causing this is the following:
 https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   metadata.set(entry.getKey(), entry.getValue().toString());
 243: }
 {noformat} 
 Which should probably be replaced by something like this:
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   if (entry.getValue() == null) continue;
 243:   metadata.set(entry.getKey(), entry.getValue().toString());
 244: }
 {noformat} 
 Exception trace :
 {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.video.FLVParser@58d9660d
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-26 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720694#comment-13720694
 ] 

Ray Gauss II commented on TIKA-1154:


I've been pushing the metadata-extractor Maven release through Sonatype thus 
far, but Mr. Noakes has been granted access there [1].

If there's no response to your Google code issue I can push a 2.6.2.1 release 
that upgrades xercesImpl to 2.11.0 which, on first look, compiles and has no 
test failures.


[1] https://issues.sonatype.org/browse/OSSRH-3948

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Tika Core and Parsers Test Artifacts

2013-07-22 Thread Ray Gauss II

Hi Ken, 

Yes, by other tika projects I meant tika-app, tika-bundle, tika-xmp, etc., and 
yes each sub-project would end up with it's own test-jar.

It probably makes more sense to just add the plugin to each project 
individually.

Since there's been no opposition to the concept in general I'll create a JIRA 
issue where we can discuss the details.

Regards,

Ray


On Jul 21, 2013, at 3:25 PM, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Ray,
 
 On Jul 18, 2013, at 6:37am, Ray Gauss II wrote:
 
 Hi Ken,
 
 They recommend typetest-jar/type instead of classifier now [1], but yes.
 
 Thanks for the reference.
 
 Perhaps the other tika projects could benefit from this as well and it could 
 just go into tika-parent's build plugins.
 
 By other tika projects do you mean things like tika-app?
 
 And if it's in the tika-parent's build plugins, does that mean each 
 sub-project would wind up with its own corresponding test-jar?
 
 Thanks,
 
 -- Ken
 
 [1] http://maven.apache.org/guides/mini/guide-attached-tests.html
 
 
 On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote:
 
 Hi Ray,
 
 On Jul 18, 2013, at 5:14am, Ray Gauss II wrote:
 
 I don't recall if we've discussed this already (I did do a brief search 
 and didn't see anything).
 
 Is there any opposition to adding test-jar Maven artifacts for tika-core 
 and tika-parsers?
 
 Seems like it would be good to allow others to extend from tests there if 
 need be.
 
 +1
 
 I assume you're talking about adding a 
 tika-(core|parsers)-version-tests.jar, so that we'd pull it in via:
 
  dependency
   groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.4/version
  classifiertests/classifier
  scopetest/scope
  /dependency
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 
 
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr

[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2013-07-22 Thread Ray Gauss II (JIRA)

Ray Gauss II created TIKA-1151:
--

 Summary: Maven Build Should Automatically Produce test-jar 
Artifacts
 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II


The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Tika Core and Parsers Test Artifacts

2013-07-18 Thread Ray Gauss II

I don't recall if we've discussed this already (I did do a brief search and 
didn't see anything).

Is there any opposition to adding test-jar Maven artifacts for tika-core and 
tika-parsers?

Seems like it would be good to allow others to extend from tests there if need 
be.

Re: Tika Core and Parsers Test Artifacts

2013-07-18 Thread Ray Gauss II

Hi Ken,

They recommend typetest-jar/type instead of classifier now [1], but yes.

Perhaps the other tika projects could benefit from this as well and it could 
just go into tika-parent's build plugins.

Regards,

Ray


[1] http://maven.apache.org/guides/mini/guide-attached-tests.html


On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Ray,
 
 On Jul 18, 2013, at 5:14am, Ray Gauss II wrote:
 
 I don't recall if we've discussed this already (I did do a brief search and 
 didn't see anything).
 
 Is there any opposition to adding test-jar Maven artifacts for tika-core and 
 tika-parsers?
 
 Seems like it would be good to allow others to extend from tests there if 
 need be.
 
 +1
 
 I assume you're talking about adding a 
 tika-(core|parsers)-version-tests.jar, so that we'd pull it in via:
 
dependency
 groupIdorg.apache.tika/groupId
artifactIdtika-parsers/artifactId
version1.4/version
classifiertests/classifier
scopetest/scope
/dependency
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr

[jira] [Created] (TIKA-1147) Passing a File-Based TikaInputStream to ExternalEmbedder Delete

2013-07-17 Thread Ray Gauss II (JIRA)

Ray Gauss II created TIKA-1147:
--

 Summary: Passing a File-Based TikaInputStream to ExternalEmbedder 
Delete
 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
Reporter: Ray Gauss II




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed

2013-07-17 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1147:
---

  Component/s: metadata
  Description: 
When an application using Tika passes {{InputStream}} objects to 
{{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
which is then deleted after embedding takes place.

However, if the application passes in a file-based {{TikaInputStream}} the 
embedder ends up dealing with directly with the original source file, which is 
then deleted after embedding takes place.
 Priority: Critical  (was: Major)
Affects Version/s: 1.4
 Assignee: Ray Gauss II
  Summary: File-Based TikaInputStreams are Deleted by 
ExternalEmbedder.embed  (was: Passing a File-Based TikaInputStream to 
ExternalEmbedder Delete)

 File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
 -

 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.4
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical

 When an application using Tika passes {{InputStream}} objects to 
 {{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
 which is then deleted after embedding takes place.
 However, if the application passes in a file-based {{TikaInputStream}} the 
 embedder ends up dealing with directly with the original source file, which 
 is then deleted after embedding takes place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed

2013-07-17 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1147.


   Resolution: Fixed
Fix Version/s: 1.5

Resolved in r1504302.

 File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
 -

 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.4
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical
 Fix For: 1.5


 When an application using Tika passes {{InputStream}} objects to 
 {{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
 which is then deleted after embedding takes place.
 However, if the application passes in a file-based {{TikaInputStream}} the 
 embedder ends up dealing with directly with the original source file, which 
 is then deleted after embedding takes place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: RFC822Parser build error on gump

2013-06-28 Thread Ray Gauss II

I know very little about gump, but looking at the log the build seems to have 
skipped the mime4j artifacts altogether.


On Jun 25, 2013, at 6:25 PM, Nick Burch apa...@gagravarr.org wrote:

 Hi All
 
 Anyone have any idea about this compiler error on the tika parsers project as 
 hit by gump?
 http://vmgump.apache.org/gump/public/tika/tika-parsers/gump_work/build_tika_tika-parsers.html
 
 Gump notifications will hopefully start again soon, which'd let us find out 
 about breaking changes from upstream Apache projects in advance, so it'd be 
 good to get the build working ready!
 
 Nick

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-13 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682644#comment-13682644
 ] 

Ray Gauss II commented on TIKA-1130:


I've created a unit test that reproduces the issue with a stripped down version 
of the original file.

Shall I comment out the actual test and commit?

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-13 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682924#comment-13682924
 ] 

Ray Gauss II commented on TIKA-1130:


Test file and method committed in r1492909.

This was just added onto {{OOXMLParserTest}} and named with a {{disabled}} 
prefix rather than using {{@Ignore}}.  I think we should start moving towards 
that for new test classes though.

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition

2013-06-11 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1135.


Resolution: Fixed

Resolved in r1491935.

 Incorrect Cardinality and Case in IPTC Metadata Definition
 --

 Key: TIKA-1135
 URL: https://issues.apache.org/jira/browse/TIKA-1135
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.4


 Some of the fields defined in the {{IPTC}} interface have incorrect 
 cardinality and metadata key names with incorrect case.
 The change of key names should be done though composite properties which 
 include deprecated versions of the incorrect names as secondary properties 
 for backwards compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition

2013-06-11 Thread Ray Gauss II (JIRA)

Ray Gauss II created TIKA-1135:
--

 Summary: Incorrect Cardinality and Case in IPTC Metadata Definition
 Key: TIKA-1135
 URL: https://issues.apache.org/jira/browse/TIKA-1135
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.4


Some of the fields defined in the {{IPTC}} interface have incorrect cardinality 
and metadata key names with incorrect case.

The change of key names should be done though composite properties which 
include deprecated versions of the incorrect names as secondary properties for 
backwards compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)

Ray Gauss II created TIKA-1133:
--

 Summary: Ability to Allow Empty and Duplicate Tika Values for XML 
Elements
 Key: TIKA-1133
 URL: https://issues.apache.org/jira/browse/TIKA-1133
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II


In some cases it is beneficial to allow empty and duplicate Tika metadata 
values for multi-valued XML elements like RDF bags.

Consider an example where the original source metadata is structured something 
like:
{code}
Person
  FirstNameJohn/FirstName
  LastNameSmith/FirstName
/Person
Person
  FirstNameJane/FirstName
  LastNameDoe/FirstName
/Person
Person
  FirstNameBob/FirstName
/Person
Person
  FirstNameKate/FirstName
  LastNameSmith/FirstName
/Person
{code}

and since Tika stores only flat metadata we transform that before invoking a 
parser to something like:
{code}
 custom:FirstName
  rdf:Bag
   rdf:liJohn/rdf:li
   rdf:liJane/rdf:li
   rdf:liBob/rdf:li
   rdf:liKate/rdf:li
  /rdf:Bag
 /custom:FirstName
 custom:LastName
  rdf:Bag
   rdf:liSmith/rdf:li
   rdf:liDoe/rdf:li
   rdf:li/rdf:li
   rdf:liSmith/rdf:li
  /rdf:Bag
 /custom:LastName
{code}

The current behavior ignores empties and duplicates and we don't know if Bob or 
Kate ever had last names.  Empties or duplicates in other positions result in 
an incorrect mapping of data.

We should allow the option to create an {{ElementMetadataHandler}} which allows 
empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1133.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1491680.

 Ability to Allow Empty and Duplicate Tika Values for XML Elements
 -

 Key: TIKA-1133
 URL: https://issues.apache.org/jira/browse/TIKA-1133
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.4


 In some cases it is beneficial to allow empty and duplicate Tika metadata 
 values for multi-valued XML elements like RDF bags.
 Consider an example where the original source metadata is structured 
 something like:
 {code}
 Person
   FirstNameJohn/FirstName
   LastNameSmith/FirstName
 /Person
 Person
   FirstNameJane/FirstName
   LastNameDoe/FirstName
 /Person
 Person
   FirstNameBob/FirstName
 /Person
 Person
   FirstNameKate/FirstName
   LastNameSmith/FirstName
 /Person
 {code}
 and since Tika stores only flat metadata we transform that before invoking a 
 parser to something like:
 {code}
  custom:FirstName
   rdf:Bag
rdf:liJohn/rdf:li
rdf:liJane/rdf:li
rdf:liBob/rdf:li
rdf:liKate/rdf:li
   /rdf:Bag
  /custom:FirstName
  custom:LastName
   rdf:Bag
rdf:liSmith/rdf:li
rdf:liDoe/rdf:li
rdf:li/rdf:li
rdf:liSmith/rdf:li
   /rdf:Bag
  /custom:LastName
 {code}
 The current behavior ignores empties and duplicates and we don't know if Bob 
 or Kate ever had last names.  Empties or duplicates in other positions result 
 in an incorrect mapping of data.
 We should allow the option to create an {{ElementMetadataHandler}} which 
 allows empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

2013-06-07 Thread Ray Gauss II

I think the Parser interface Javadoc would make sense as a place to document, 
but I don't know if there is an existing policy.

We'll certainly need to consider things like DelegatingParsers which may be 
using other parsers to do portions of the work.

Not the principle comment you were looking for, but my 2 cents.

Ray

On Jun 7, 2013, at 7:30 AM, Christian Reuschling reuschl...@dfki.uni-kl.de 
wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 it would be very interesting if somebody has a principle comment on this 
 thread...
 
 
 On 29.05.2013 14:42, Nick Burch wrote:
 On Wed, 29 May 2013, Christian Reuschling wrote:
 Nevertheless, in this case an Exception (like in all other parsers) or a 
 tika body with
 length zero, which is indicated at least by handler.endDocument() would be 
 the appropriate
 way, isn't it? - From the ContentHandlers point of view, there is nothing 
 in between.
 
 I'm not sure if we do have a properly documented policy on what a parser 
 should do if it
 receives a file it can't handle. For ones that are invalid (eg corrupt), I 
 believe an exception
 is the expected result. The case when the file seems valid, but can't be 
 handled by the parser,
 not sure
 
 Does anyone know if we have a policy on this, and/or where we should 
 document it?
 
 Nick
 
 - -- 
 __
 Christian Reuschling, Dipl.-Ing.(BA)
 Software Engineer
 
 Knowledge Management Department
 German Research Center for Artificial Intelligence DFKI GmbH
 Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
 
 Phone: +49.631.20575-1250
 mailto:reuschl...@dfki.de  http://www.dfki.uni-kl.de/~reuschling/
 
 - Legal Company Information Required by German 
 Law--
 Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
  Dr. Walter Olthoff
 Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
 Amtsgericht Kaiserslautern, HRB 2313=
 __
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.19 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
 
 iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp
 tRAAoIbAkUjqZujYM/BHINMmbhNswir9
 =a1xL
 -END PGP SIGNATURE-

[jira] [Assigned] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1115:
--

Assignee: Ray Gauss II

 ExifHandler throws NullPointerException
 ---

 Key: TIKA-1115
 URL: https://issues.apache.org/jira/browse/TIKA-1115
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
 Environment: verified on Mac OSX and Ubuntu 12.04
Reporter: Lee Graber
Assignee: Ray Gauss II
  Labels: ImageMetadataExtractor
 Attachments: 654000main_transit-hubble-orig_full.jpg

   Original Estimate: 2h
  Remaining Estimate: 2h

 Notice that in the second if block, there is no check for null on the 
 retrived datetime. I have hit this with a file which apparently has null for 
 this value. Seems like the fix is trivial
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
 Date datetime = 
 directory.getDate(ExifIFD0Directory.TAG_DATETIME);
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(datetime);
 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
 // If Date/Time Original does not exist this might be 
 creation date
 if (metadata.get(TikaCoreProperties.CREATED) == null) {
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 }
 }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646709#comment-13646709
 ] 

Ray Gauss II commented on TIKA-1115:


Hi Lee,

Do we have permission to include the problem file at a greatly reduced size, 
say 64px wide, as a test file?

 ExifHandler throws NullPointerException
 ---

 Key: TIKA-1115
 URL: https://issues.apache.org/jira/browse/TIKA-1115
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
 Environment: verified on Mac OSX and Ubuntu 12.04
Reporter: Lee Graber
Assignee: Ray Gauss II
  Labels: ImageMetadataExtractor
 Attachments: 654000main_transit-hubble-orig_full.jpg

   Original Estimate: 2h
  Remaining Estimate: 2h

 Notice that in the second if block, there is no check for null on the 
 retrived datetime. I have hit this with a file which apparently has null for 
 this value. Seems like the fix is trivial
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
 Date datetime = 
 directory.getDate(ExifIFD0Directory.TAG_DATETIME);
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(datetime);
 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
 // If Date/Time Original does not exist this might be 
 creation date
 if (metadata.get(TikaCoreProperties.CREATED) == null) {
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 }
 }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1115.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1478111

 ExifHandler throws NullPointerException
 ---

 Key: TIKA-1115
 URL: https://issues.apache.org/jira/browse/TIKA-1115
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
 Environment: verified on Mac OSX and Ubuntu 12.04
Reporter: Lee Graber
Assignee: Ray Gauss II
  Labels: ImageMetadataExtractor
 Fix For: 1.4

 Attachments: 654000main_transit-hubble-orig_full.jpg

   Original Estimate: 2h
  Remaining Estimate: 2h

 Notice that in the second if block, there is no check for null on the 
 retrived datetime. I have hit this with a file which apparently has null for 
 this value. Seems like the fix is trivial
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
 Date datetime = 
 directory.getDate(ExifIFD0Directory.TAG_DATETIME);
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(datetime);
 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
 // If Date/Time Original does not exist this might be 
 creation date
 if (metadata.get(TikaCoreProperties.CREATED) == null) {
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 }
 }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Build failed in Jenkins: Tika-trunk #994

2013-05-01 Thread Ray Gauss II

Looks like a possible build server problem.  Does anyone have access to 
manually trigger another build?

Regards,

Ray

On May 1, 2013, at 5:01 PM, Apache Jenkins Server  jenk...@builds.apache.org 
wrote:

 See https://builds.apache.org/job/Tika-trunk/994/changes

Re: Build failed in Jenkins: Tika-trunk #994

2013-05-01 Thread Ray Gauss II

 Subject: Jenkins build is back to normal : Tika-trunk #995

Yay, thanks!


On May 1, 2013, at 5:24 PM, Michael McCandless luc...@mikemccandless.com 
wrote:

 I just kicked off another build ... (it's queued).
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 
 On Wed, May 1, 2013 at 5:12 PM, Ray Gauss II ray.ga...@alfresco.com wrote:
 Looks like a possible build server problem.  Does anyone have access to 
 manually trigger another build?
 
 Regards,
 
 Ray
 
 On May 1, 2013, at 5:01 PM, Apache Jenkins Server  
 jenk...@builds.apache.org wrote:
 
 See https://builds.apache.org/job/Tika-trunk/994/changes

[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Ray Gauss II (JIRA)

[
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584194#comment-13584194
]

Ray Gauss II commented on TIKA-1074:

bq. But it's a little weird throw TikaExc in response to an interrupt (ie, code
above will be trying to catch an IE) ... I think it's cleaner to set the
interrupt bit and let the next place that waits see the interrupt bit and throw
IE?

That's what I found in my investigation for TIKA-775 / TIKA-1059 as well.

Extraction should continue if an exception is hit visiting an embedded
document
---

Key: TIKA-1074
URL: https://issues.apache.org/jira/browse/TIKA-1074
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 1.4

Attachments: TIKA-1074.patch, TIKA-1074.patch

Spinoff from TIKA-1072.
In that issue, a problematic document (still not sure if document is corrupt,
or possible POI bug) caused an exception when visiting the embedded documents.
If I change Tika to suppress that exception, the rest of the document
extracts fine.
So somehow I think we should be more robust here, and maybe log the
exception, or save/record the exception(s) somewhere so after parsing the app
could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1068) Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data

2013-01-30 Thread Ray Gauss II (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13566693#comment-13566693
 ] 

Ray Gauss II commented on TIKA-1068:


I can't reproduce this using tika-app from either the download distribution or 
compiled from source.

We're using the 2.6.2 metadata-extractor jar from Maven central repository [1].

I'm not sure how your build is structured but perhaps you're including a 2.6.2 
metadata-extractor jar you've downloaded from elsewhere?  If so, can you try 
replacing that with the one on Maven central? 


[1] 
http://search.maven.org/#artifactdetails%7Ccom.drewnoakes%7Cmetadata-extractor%7C2.6.2%7Cjar

 Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data
 --

 Key: TIKA-1068
 URL: https://issues.apache.org/jira/browse/TIKA-1068
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Magnus Lövgren
Priority: Critical
 Attachments: vinter080501-66.jpg


 Using Tika 1.3, parsing of jpg files throws NoSuchMethodError when the jpg 
 contains xmp data. No Error was thrown in Tika 1.2.
 The metadata-extractor was updated in Tika 1.3 (to 
 com.drewnoakes:metadata-extractor:2.6.2), See TIKA-811 (duplicated by 
 TIKA-996). That jar is badly compiled (as mentioned by Emmanuel Hugonnet as 
 comment on TIKA-915) and causes the NoSuchMethodError!
 = the metadata-extractor 2.6.2 jar needs to be replaced! Problem seems fixed 
 in metadata-extractor 2.7.0, but that isn't released yet.
 Discussions available at:
 http://code.google.com/p/metadata-extractor/issues/detail?id=39
 http://code.google.com/p/metadata-extractor/issues/detail?id=55
 Code to reproduce problem:
 =
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-core/artifactId
   version1.3/version
 /dependency
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-xmp/artifactId
   version1.3/version
 /dependency
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.3/version
 /dependency
 InputStream inputStream = ... // vinter080501-66.jpg file (attached)
 ContentHandler contentHandler = new BodyContentHandler(200);
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 Parser parser = new AutoDetectParser();
 parser.parse(inputStream, contentHandler, metadata, context); // Throws 
 NoSuchMethodError
 = java.lang.NoSuchMethodError: 
 com.adobe.xmp.properties.XMPPropertyInfo.getValue()Ljava/lang/Object;
   at com.drew.metadata.xmp.XmpReader.extract(Unknown Source)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(Unknown
  Source)
   at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown Source)
   at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [VOTE] Apache Tika 1.3 Release Candidate #1

2013-01-20 Thread Ray Gauss II

Built on OS X, updated tika-exiftool to depend on 1.3 which compiled and passed 
tests.

+1 for release!

Cheers,

Ray


On Jan 18, 2013, at 11:30 PM, Dave Meikle loo...@gmail.com wrote:

 Hi Guys,
 
 A candidate for the Tika 1.3 release is available at:
 
http://people.apache.org/~dmeikle/apache-tika-1.3-rc1/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/tika/tags/tika-1.3/
 
 The SHA1 checksum of the archive is a80e45d1976e655381d6e93b50b9c7b118e9d6fc.
 
 A staged M2 repository can also be found on repository.apache.org here:
 
 https://repository.apache.org/content/repositories/orgapachetika-147/
 
 Please vote on releasing this package as Apache Tika 1.3.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.
 
[ ] +1 Release this package as Apache Tika 1.3
[ ] -1 Do not release this package because...
 
 Here is my +1 for the release.
 
 Cheers,
 Dave

[jira] [Created] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-01-18 Thread Ray Gauss II (JIRA)

Ray Gauss II created TIKA-1059:
--

 Summary: Better Handling of InterruptedException in ExternalParser 
and ExternalEmbedder
 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.4


The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
{{InterruptedException}} and ignore it.

The methods should either call {{interrupt()}} on the current thread or 
re-throw the exception, possibly wrapped in a {{TikaException}}.

See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-775) Embed Capabilities

2013-01-18 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-775.
---

   Resolution: Fixed
Fix Version/s: (was: 1.4)
   1.3
 Assignee: Ray Gauss II

 Embed Capabilities
 --

 Key: TIKA-775
 URL: https://issues.apache.org/jira/browse/TIKA-775
 Project: Tika
  Issue Type: Improvement
  Components: general, metadata
Affects Versions: 1.0
 Environment: The default ExternalEmbedder requires that sed be 
 installed.
Reporter: Ray Gauss II
Assignee: Ray Gauss II
  Labels: embed, patch
 Fix For: 1.3

 Attachments: embed_20121029.diff, embed.diff, 
 tika-core-embed-patch.txt, tika-parsers-embed-patch.txt


 This patch defines and implements the concept of embedding tika metadata into 
 a file stream, the reverse of extraction.
 In the tika-core project an interface defining an Embedder and a generic sed 
 ExternalEmbedder implementation meant to be extended or configured are added. 
  These classes are essentially a reverse flow of the existing Parser and 
 ExternalParser classes.
 In the tika-parsers project an ExternalEmbedderTest unit test is added which 
 uses the default ExternalEmbedder (calls sed) to embed a value placed in 
 Metadata.DESCRIPTION then verify the operation by parsing the resulting 
 stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-01-18 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1059:
---

Issue Type: Improvement  (was: Bug)

 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
 --

 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.4


 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
 {{InterruptedException}} and ignore it.
 The methods should either call {{interrupt()}} on the current thread or 
 re-throw the exception, possibly wrapped in a {{TikaException}}.
 See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (TIKA-1056) unify ImageMetadataExtractor interface

2013-01-16 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1056:
--

Assignee: Ray Gauss II

 unify ImageMetadataExtractor interface
 --

 Key: TIKA-1056
 URL: https://issues.apache.org/jira/browse/TIKA-1056
 Project: Tika
  Issue Type: Wish
Reporter: Maciej Lizewski
Assignee: Ray Gauss II
Priority: Trivial

 there are several methods in this class that are targeted for different image 
 type but with different visibility:
 public void parseJpeg(File file);
 protected void parseTiff(InputStream stream);
 both simply extract all possible metadata from image file or stream. Would be 
 nice if parseTiff could also be public so it will be easier to create 
 custom parsers located in external jars that use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-1056) unify ImageMetadataExtractor interface

2013-01-16 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1056.


   Resolution: Fixed
Fix Version/s: 1.3

Resolved in r1434117.

 unify ImageMetadataExtractor interface
 --

 Key: TIKA-1056
 URL: https://issues.apache.org/jira/browse/TIKA-1056
 Project: Tika
  Issue Type: Wish
Reporter: Maciej Lizewski
Assignee: Ray Gauss II
Priority: Trivial
 Fix For: 1.3


 there are several methods in this class that are targeted for different image 
 type but with different visibility:
 public void parseJpeg(File file);
 protected void parseTiff(InputStream stream);
 both simply extract all possible metadata from image file or stream. Would be 
 nice if parseTiff could also be public so it will be easier to create 
 custom parsers located in external jars that use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-962) Backwards Compatibility for Metadata.LAST_AUTHOR is Broken

2013-01-08 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-962.
---

Resolution: Fixed

This has been fixed, but I didn't resolve for 1.3 as I thought it might be 
worthy of a fix release.

 Backwards Compatibility for Metadata.LAST_AUTHOR is Broken
 --

 Key: TIKA-962
 URL: https://issues.apache.org/jira/browse/TIKA-962
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.2
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical
 Fix For: 1.3


 As a result of changes in TIKA-930, support for the deprecated 
 Metadata.LAST_AUTHOR property has been dropped.
 The new TikaCoreProperties.MODIFIED should be a composite property containing 
 Metadata.LAST_AUTHOR.
 Should we consider a fix release for this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-963) Backwards Compatibility for Metadata.DATE is Incorrect

2013-01-08 Thread Ray Gauss II (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-963.
---

Resolution: Fixed

This has been fixed, but I didn't resolve for 1.3 as I thought it might be 
worthy of a fix release.

 Backwards Compatibility for Metadata.DATE is Incorrect
 --

 Key: TIKA-963
 URL: https://issues.apache.org/jira/browse/TIKA-963
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.2
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical
 Fix For: 1.3


 Metadata.DATE was always somewhat ambiguous, but during the consolidation in 
 TIKA-930 it was incorrectly assumed that most parsers used it as a creation 
 date.
 Metadata.DATE needs to instead be part of the TikaCoreProperties.MODIFIED 
 composite property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [DISCUSS] Release Candidate for 1.3?

2013-01-08 Thread Ray Gauss II

The code for TIKA-775 [1] is on trunk but it was re-opened with some concerns, 
some of which were addressed and some of which are still open discussions, 
though I think minor enough to create separate issues if need be and resolve 
TIKA-775 as fixed.

[1] https://issues.apache.org/jira/browse/TIKA-775


On Jan 8, 2013, at 4:56 PM, Dave Meikle loo...@gmail.com wrote:

 Hi All,
 
 We have got some new features and bugs fixed with a couple of outstanding 
 binary compatibility ones (TIKA-962, TIKA-963) fixed on trunk, so I was 
 wondering if it was time for a 1.3 release?
 
 Also, happy to do the Release Management for it.
 
 Cheers,
 Dave

[jira] [Resolved] (TIKA-895) Empty title element makes Tika-generated HTML documents not open


 [ 
https://issues.apache.org/jira/browse/TIKA-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-895.
---

Resolution: Duplicate
  Assignee: Ray Gauss II

 Empty title element makes Tika-generated HTML documents not open
 

 Key: TIKA-895
 URL: https://issues.apache.org/jira/browse/TIKA-895
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.1
 Environment: Windows 7 
Reporter: Benoit MAGGI
Assignee: Ray Gauss II
Priority: Trivial
  Labels: newbie

 I try to transform an empty docx to an html file.
 Ex : java -jar tika-app-1.1.jar -x example.docx  t.html
 The html file can't be open with Firefox,Internet Explorer and Chrome.
 The main point is that title/ seems to be forbiden by html specification 
 (can't get the point on html5)
 bq. http://www.w3.org/TR/html401/struct/global.html#h-7.4.2 
 bq. 7.4.2 The TITLE element 
 bq. !-- The TITLE element is not considered part of the flow of text.
 bq.It should be displayed, for example as the page header or
 bq.window title. Exactly one title is required per document.
 bq. --
 bq. !ELEMENT TITLE 
 http://www.w3.org/TR/html401/struct/global.html#edef-TITLE  - - (#PCDATA) 
 -(%head.misc; 
 bq. http://www.w3.org/TR/html401/sgml/dtd.html#head.misc ) -- document 
 title --
 bq. !ATTLIST TITLE %i18n http://www.w3.org/TR/html401/sgml/dtd.html#i18n 
 bq. *Start tag: required, End tag: required*
 For information there was the same bug with xls
 https://issues.apache.org/jira/browse/TIKA-725
 The simple solution should be to provide an empty title by default

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium


 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-725:
---

  Assignee: Ray Gauss II  (was: Jukka Zitting)

Confirmed that the problem remains when a {{TransformerHandler}} is used, such 
those obtained from {{SAXTransformerFactory}} in {{TikaCLI}} and {{TikaGUI}}.

I've investigated and developed a workaround.

 Empty title element makes Tika-generated HTML documents not open in Chromium
 

 Key: TIKA-725
 URL: https://issues.apache.org/jira/browse/TIKA-725
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 0.9
 Environment: Chromium 12 on Ubuntu Linux
Reporter: Henri Bergius
Assignee: Ray Gauss II
Priority: Minor
  Labels: html
 Fix For: 0.10


 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
 empty title element as title/ into the document HEAD section. This causes 
 Chromium not to display the document contents.
 Switching it to title/title fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-914) Invalid self-closing title tag when parsing an RTF file


 [ 
https://issues.apache.org/jira/browse/TIKA-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-914.
---

Resolution: Duplicate
  Assignee: Ray Gauss II

 Invalid self-closing title tag when parsing an RTF file
 ---

 Key: TIKA-914
 URL: https://issues.apache.org/jira/browse/TIKA-914
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.1
 Environment: Reproduced on Linux and Windows
Reporter: Nicolas Guillaumin
Assignee: Ray Gauss II
Priority: Minor
  Labels: rtf
 Attachments: test.rtf


 When parsing an RTF file with an empty TITLE metadata, the resulting HTML 
 contains an self-closing title tag:
 {code}
 $ java -jar tika-app-1.1.jar -h test.rtf
 html xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=Content-Length content=830468/
 meta name=Content-Type content=application/rtf/
 meta name=resourceName content=test.rtf/
 title/
 /head
 [...]
 {code}
 I believe self-closing tags are not valid in XHTML, according to 
 http://www.w3.org/TR/xhtml1/#C_3 (However there's no XHTML doctype generated 
 here, just a namespace...). Anyway this causes some browsers like Chrome to 
 fail parsing the HTML, resulting in a blank page displayed.
 The expected output would be a non self-closing empty tag: {{title/title}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium