Re: renaming master?

2020-06-17 Thread Ray Gauss II
Hi all,

Apologies for not being able to be very involved over the past few years, but 
still trying to follow along and hoping to get time to contribute in the future.

Another option might be ‘stable’?

- Ray

> On Jun 16, 2020, at 1:31 PM, Tim Allison  wrote:
> 
> All,
> 
>  As you may have seen, there's a movement to rename the "master" branch to
> "main" or "trunk" (at least in the U.S.)[1][2].  Github is doing this, and
> I personally think this makes sense.
> 
>  Are there any objections if we change "master"?  If we do change it, is
> there a preference for "main", "trunk" or something else?
> 
>  My personal preference would be for trunk, but I'm open.
> 
> Best,
> 
> Tim
> 
> [1]
> https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/
> [2] https://www.bbc.com/news/technology-53050955


[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors

2016-08-25 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436705#comment-15436705
 ] 

Ray Gauss II commented on TIKA-2056:


My guess is that when Exiftool is available on the command line the existing 
[external parser is 
enabled|https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml]
 as part of the {{CompositeExternalParser}} which would get included in the 
{{AutoDetectParser}} and something in that chain is failing serialization.

Perhaps because 
[ExternalParser.LineConsumer|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59]
 is not Serializable?

> Installing exiftool causes ForkParserIntegration test errors
> 
>
> Key: TIKA-2056
> URL: https://issues.apache.org/jira/browse/TIKA-2056
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.14
>Reporter: Chris A. Mattmann
>
> [~rgauss] maybe you can help me with this. For some reason when I was trying 
> your PR, I got all sorts of weird errors that I thought had to do with your 
> PR, but in fact, had to do with Fork Parser Integration test. [~kkrugler] 
> I've seen you've contributed to the Fork parser tests so tagging you on this 
> too. Any reason you guys can think of that exiftool causes the Fork parser 
> integration tests to fail?
> Here's the log msg (that I thought was due to the Sentiment parser, but is in 
> fact not!):
> {noformat}
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 124 source files to 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/test-classes
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Some input files use or override a deprecated API.
> [INFO] 
> /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java:
>  Recompile with -Xlint:deprecation for details.
> [INFO] 
> [INFO] --- maven-surefire-plugin:2.18.1:test (default-test) @ tika-parsers ---
> [INFO] Surefire report directory: 
> /Users/mattmann/tmp/tika1.14/tika-parsers/target/surefire-reports
> ---
>  T E S T S
> ---
> Running org.apache.tika.parser.fork.ForkParserIntegrationTest
> Tests run: 5, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 2.46 sec <<< 
> FAILURE! - in org.apache.tika.parser.fork.ForkParserIntegrationTest
> testForkedTextParsing(org.apache.tika.parser.fork.ForkParserIntegrationTest)  
> Time elapsed: 0.185 sec  <<< ERROR!
> org.apache.tika.exception.TikaException: Unable to serialize AutoDetectParser 
> to pass to the Forked Parser
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
> at java.util.ArrayList.writeObject(ArrayList.java:762)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method

[jira] [Commented] (TIKA-774) ExifTool Parser

2016-03-23 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209162#comment-15209162
 ] 

Ray Gauss II commented on TIKA-774:
---

bq. we should add a static check for whether exiftool is available and adjust 
"handled" mimes at that point.

I think we'll find other areas to improve on as well, I just wanted to get the 
ball rolling again on the contribution and review as we had to close the source 
on the stand-alone project mentioned above.

bq. I should have a chance to look more closely early next week, but I doubt 
there's reason to wait for my feedback.

We'd value your feed back, and it's been over 4 years, we can wait a few more 
weeks. :)

bq. Is this a replacement for the one I hacked together?

There's the possibility for the two to coexist, perhaps requiring this parser 
to be explicitly called programmatically.

At a high level the biggest differences are:
# As mentioned in TIKA-1639, there's an extensive mapping from ExifTool's 
namespace to proper Tika properties (currently done programmatically)
# It includes the ability embed, i.e. writing metadata back into binary files. 
(TIKA-776)

> ExifTool Parser
> ---
>
> Key: TIKA-774
> URL: https://issues.apache.org/jira/browse/TIKA-774
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.0
> Environment: Requires be installed 
> (http://www.sno.phy.queensu.ca/~phil/exiftool/)
>Reporter: Ray Gauss II
>  Labels: features, new-parser, newbie, patch
> Fix For: 1.13
>
> Attachments: testJPEG_IPTC_EXT.jpg, 
> tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt
>
>
> Adds an external parser that calls ExifTool to extract extended metadata 
> fields from images and other content types.
> In the core project:
> An ExifTool interface is added which contains Property objects that define 
> the metadata fields available.
> An additional Property constructor for internalTextBag type.
> In the parsers project:
> An ExiftoolMetadataExtractor is added which does the work of calling ExifTool 
> on the command line and mapping the response to tika metadata fields.  This 
> extractor could be called instead of or in addition to the existing 
> ImageMetadataExtractor and JempboxExtractor under TiffParser and/or 
> JpegParser but those have not been changed at this time.
> An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor.
> An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool 
> metadata fields to existing tika and Drew Noakes metadata fields if enabled.
> An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag 
> implementations in XML files.
> An ExifToolParserTest is added which tests several expected XMP and IPTC 
> metadata values in testJPEG_IPTC_EXT.jpg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-23 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1906:
---
Fix Version/s: 1.13
   2.0

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>    Reporter: Ray Gauss II
>    Assignee: Ray Gauss II
> Fix For: 2.0, 1.13
>
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-23 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1906.

Resolution: Fixed

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>    Reporter: Ray Gauss II
>    Assignee: Ray Gauss II
> Fix For: 2.0, 1.13
>
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206138#comment-15206138
 ] 

Ray Gauss II edited comment on TIKA-1906 at 3/22/16 2:37 PM:
-

bq. agreed, sorry must have missed that as I thought I fixed it for both per 
TIKA-1638.

No worries.

I guess I'll leave this open until the tika-2.x build is happy again.


was (Author: rgauss):
bq. agreed, sorry must have missed that as I thought I fixed it for both per 
TIKA-1638.

No worries.

I guess I'll leave this open until the tika-2.x is happy again.

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>    Reporter: Ray Gauss II
>    Assignee: Ray Gauss II
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format

2016-03-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206138#comment-15206138
 ] 

Ray Gauss II commented on TIKA-1906:


bq. agreed, sorry must have missed that as I thought I fixed it for both per 
TIKA-1638.

No worries.

I guess I'll leave this open until the tika-2.x is happy again.

> ExternalParser No Longer Supports Commands in Array Format
> --
>
> Key: TIKA-1906
> URL: https://issues.apache.org/jira/browse/TIKA-1906
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>    Reporter: Ray Gauss II
>    Assignee: Ray Gauss II
>
> After the changes in TIKA-1638 the ExternalParser now ignores commands 
> specified as a string array and assumes commands will be in a single string 
> with a space delimiter.
> Both formats should be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196030#comment-15196030
 ] 

Ray Gauss II commented on TIKA-1607:


bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor 
as is and let users write their own XMP parsers, no?

Yes, and we could do that in addition to the above, but if I'm understanding 
correctly that alone would still force users to write 'Tika-based' XMP parsers 
rather than allowing them access to the RAW XMP encoded bytes you're referring 
to in the last sentence, which I do agree might be helpful in some cases.

So the idea for the second part would be to get the user those bytes in a way 
that hopefully doesn't require sweeping changes to the parsers (I'm thinking of 
this with an eye towards all types of embedded resources, not just XMP).

The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method 
currently takes a {{Metadata}} object which is only associated with the 
embedded resource (not the same metadata object associated with the 'container' 
file) and is populated with the embedded resource's filename, type, size, etc.

Option 1. We might be able to do something like:
{code}
/**
 * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded
 * resources during parsing for retrieval.
 */
public interface StoringEmbeddedDocumentExtractor extends 
EmbeddedDocumentExtractor {

/**
 * Gets the map of known embedded resources or null if no resources
 * were stored during parsing
 * 
 * @return the embedded resources
 */
Map<Metadata, byte[]> getEmbeddedResources();

}
{code}

then modify ParsingEmbeddedDocumentExtractor to implement it with an option 
which 'turns it on'?

Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor 
that users could set in the context?

Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and 
make them use temp files?

Option 4. Maybe the effort is better spent on said sweeping parser changes to 
include some {{EmbeddedResources}} object to be optionally populated along with 
the {{Metadata}} in the {{Parser.parse}} method?

Other options?  Maybe they don't need the RAW XMP?

I'm also aware that we've strayed a bit from the original issue here of 
structured metadata.  Should we create a separate issue?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193845#comment-15193845
 ] 

Ray Gauss II edited comment on TIKA-1607 at 3/15/16 1:57 PM:
-

Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedDocumentExtractor}} 
implementation they could use without resorting to extracting them to files?


was (Author: rgauss):
Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195326#comment-15195326
 ] 

Ray Gauss II commented on TIKA-1607:


Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment).

We can currently dump stuff to files in some parsers with the {{--extract}} CLI 
option which sticks a {{FileEmbeddedDocumentExtractor}} in the context.

The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}.

Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when 
enabled, would also save the embedded resources in memory for an advanced user 
to do whatever they need, knowing the risk and resources required for that 
option?

Or provide some other in-memory implementation that advanced users could 
explicitly set in the context?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-03-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193845#comment-15193845
 ] 

Ray Gauss II commented on TIKA-1607:


Have we already considered treating the XMP packets more like embedded 
resources and making it easier for the advanced users described above to get at 
those resources, perhaps providing an {{EmbeddedResourceHandler}} 
implementation they could use without resorting to extracting them to files?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-25 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167135#comment-15167135
 ] 

Ray Gauss II commented on TIKA-1607:


I know there can be multiple XMP packets in a single file, but do we have many 
other examples where we'd need multiple DOMs associated with a single file?

I'm trying to understand if the metadata is really the right place for this.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154205#comment-15154205
 ] 

Ray Gauss II commented on TIKA-1607:


In my experience people gravitate towards 'other' buckets, i.e.: "I didn't know 
(bother to read) what the designated ones were so I just used 'other'".

{{getBytes}} feels like 'other'.

While people could still do really stupid things with {{getDOM}} if they wanted 
to, {{getBytes}} seems to encourage a developer to go ahead and try to use each 
frame of a 120fps 8K video as a 'metadata' value.  An extreme and unlikely 
example of course, but you get the gist.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2016-02-16 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149231#comment-15149231
 ] 

Ray Gauss II commented on TIKA-1607:


Are we opening a can of worms by encouraging the use of a byte array directly 
with no restrictions on length, etc.?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch, 
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, 
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130386#comment-15130386
 ] 

Ray Gauss II commented on TIKA-1824:


bq. Thank you, Bob Paulin! Again, this is fantastic.

Indeed, thanks!

bq. Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module

Now that the change is in there it seems a bit redundant to have parser and 
module in every artifact ID.  {{tika-parser-*}} follows the least to most 
specific precedence and they're so perhaps we could just remove module?

I had some concerns over the apparent duplication of dependencies / versions 
but it looks like that will be addressed in TIKA-1847.

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-09-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746719#comment-14746719
 ] 

Ray Gauss II commented on TIKA-1607:


Hi [~talli...@mitre.org], apologies for the delay on responding here.

1. POJOs
bq. We might have better documentation of POJOs and compile-time guarantees 
about methods and typed values.

Agreed, but the DOM persistence doesn't preclude us from also using Java 
'helper' classes that know how to more easily get and set values for particular 
schemas that we'd like to focus on.

bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd 
and maintain it?

I'd vote for sticking as true to a specification's original schema as possible 
when there is one but whether we'd want to build and maintain for those that 
don't is a good question.

2. Passthrough
bq. why couldn't we literally pass that through via the String version of the 
xml?

I think we could, but we'd first have to 'merge' with the metadata being 
modeled by the parsers and could then allow access to the full DOM {{Document}} 
object which clients could easily serialize to a string if need be.

3. Serialization to JSON
There seem to be several libraries available that can help with XML to JSON, 
though I don't think this would belong in core.

4. Multilingual fields
Great question.  XMP uses RDF and xml:lang:
{noformat}

  
quick brown fox
rapido fox marrone
  

{noformat}
that's one possibility.

bq. I'm wondering if we want to add structure only where structured data 
doesn't exist within the document and let the client parse what they'd like out 
of structured metadata that is in the document?

This also relates to passthrough above but one thing to keep in mind is that 
the metadata we're parsing could be coming from several different parts of the 
binary.  For example, EXIF doesn't necessarily also live in XMP (though most 
apps also write it there these days) and there can be more than one XMP packet 
present in a file.  It would be nice to bring these different sources into a 
unified persistence structure, even if for simpler metadata everything lives at 
the top level.

bq. how do we transfer as much normalized/structured metadata as possible in as 
simple a way to the end user.

This also gets back to passthrough and the possibility of access to the full 
DOM {{Document}} object.

Thanks for keeping the discussion going.  We obviously need to take great care 
in changing such a fundamental area of the code.

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
>  Issue Type: Improvement
>  Components: core, metadata
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap HashMap> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706706#comment-14706706
 ] 

Ray Gauss II commented on TIKA-1607:


Yes, by shoehorn I meant that the index is embedded in the key (in this case 
sub-group name) and that all parsers and consuming client apps must know to 
utilize that syntax rather than either a separate, explicit index field or a 
well defined structure like that of the DOM approach.

Perhaps we should flesh out a solid requirements list (possibly using the 
[comment 
above|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=14660441page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660441]
 as a starting point).

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-20 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704880#comment-14704880
 ] 

Ray Gauss II commented on TIKA-1607:


I did see that, but I was after full URI namespaces, i.e. 
{{http://purl.org/dc/elements/1.1/}}, not just prefixes.

The OODT approach looks like you'd have to shoehorn the index into the group 
name, much like the tika-ffmpeg workaround, rather than a more strictly defined 
structure.

OODT might support deeper structures in the inner {{Group}} class, but the 
public methods appear to only support a single level?  For example, How could 
one get to something like the value of the city of the 3rd contact's 2nd 
address, i.e. p1:contact[2]/p1:address[1]/p1:city?

We could mimic XPath syntax but the DOM approach allows us to use 
{{javax.xml.xpath.XPath}} processing.  From the [test mentioned 
above|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]:
{code:java}
String expression = /tika:metadata/vcard:tel[1]/vcard:uri;
assertEquals(telUri, metadata.getValueByXPath(expression));
{code}

The DOM approach would also allow us to leverage things like attributes to 
further describe a particular metadata value in the future if need be.

We might also be able to pass through entire metadata structures that Tika 
hasn't explicitly modeled.

It's certainly a larger change, but I think it gives us a lot more options.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703924#comment-14703924
 ] 

Ray Gauss II commented on TIKA-1607:


I've put together the start of the DOM metadata store option on [GitHub as 
well|https://github.com/apache/tika/compare/trunk...rgauss:trunk].

The crux of the change is using a {{org.w3c.dom.Document}} object instead of a 
{{MapString, String[]}} as the metadata store and Property objects based on 
{{QName}}s instead of Strings.

A few things to note:
* This does bring in commons-lang for XML escaping, we could change if need be
* It seems mostly backwards compatible. tika-xmp is failing at the moment, but 
I think it's just a matter of applying the same techniques there
* String-based accessors weren't deprecated, but could be if targeting Tika 2.0
* There are several TODOs that would still need to be addressed

The [test 
added|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]
 demonstrates creating a DOM structure, adding it to the metadata, then pulling 
it out both programmatically and via XPath expression (sticking to the 
telephone number example).

That programmatic creation of the DOM structure is a bit cumbersome and we 
could certainly employ Java classes specific to each standard as a convenience 
(somewhat similar to [~talli...@mitre.org]'s proposal), but I do like the 
generic nature of the DOM store.

The {{toString}} method of the metadata object after building that example is 
properly structured and namespaced XML:
{code:xml}
?xml version=1.0 encoding=UTF-8 standalone=no?
tika:metadata xmlns:tika=http://tika.apache.org/;
  vcard:tel xmlns:vcard=urn:ietf:params:xml:ns:vcard-4.0
vcard:parameters
  vcard:type
vcard:textwork/vcard:text
  /vcard:type
/vcard:parameters
vcard:uritel:+1-800-555-1234/vcard:uri
  /vcard:tel
/tika:metadata
{code}

There's obviously lots of room for improvement and discussion but I wanted to 
put it out there before the momentum on this slows.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704108#comment-14704108
 ] 

Ray Gauss II commented on TIKA-1607:


[~chrismattmann], I did.

It seemed more similar to the XPath-like workaround I described with the notion 
of groups in the store, rather than the full-fledged DOM store proposed in the 
GitHub fork, i.e. I didn't see where anything was namespaced.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.11

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

2015-08-06 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660441#comment-14660441
 ] 

Ray Gauss II commented on TIKA-1607:


To clarify, the work mentioned above that uses an XPath-like syntax is only a 
workaround for mapping structured metadata into the current 'flat' metadata 
model in Tika.

I fully support moving towards a structured metadata store in a 2.0 timeframe. 
(maybe that's now?)

This is simply restating some of what's already been said, but there are many 
aspects to consider during that refactoring:
* Moving towards properly namespacing metadata (even if, for now, our 
serialization of it only contains a prefix)
* Backwards compatibility for simple string key/values
* Enabling easy serialization to XML and JSON
* Enabling easy discovery of at least top level elements
* Lightweight dependencies in tika-core
* Possible representation of binary data
* Not re-inventing the wheel

Given the above, perhaps we'd want to consider using Java DOM 
({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and 
getting child nodes, etc. rather than hard coding POJOs for each metadata 
standard we want to support.

I'll try to find some time to put together an example patch for that approach 
in the next few days.

 Introduce new arbitrary object key/values data structure for persistence of 
 Tika Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.10

 Attachments: TIKA-1607v1_rough_rough.patch, 
 TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata

2015-04-21 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505054#comment-14505054
 ] 

Ray Gauss II commented on TIKA-1607:


We've had a few discussions on structured metadata over the years, some of 
which was captured in the [MetadataRoadmap Wiki 
page|http://wiki.apache.org/tika/MetadataRoadmap].

I'd agree that we should strive to maintain backwards compatibility for simple 
values.

I think we should also consider serialization of the metadata store, not just 
in the {{Serializable}} interface sense, but perhaps being able to easily 
marshal the entire metadata store into JSON and XML.

As [~gagravarr] points out, work has been done to express structured metadata 
via the existing metadata store.  In that email thread you'll find reference to 
the external [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

 Introduce new HashMapString, Object data structure for persitsence of Tika 
 Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.9


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  Object
 {code}
 Where Object could be a CollectionHashMapString/Property, 
 HashMapString/Property, String/Int/Long e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1594) Webp parsing support

2015-04-07 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484463#comment-14484463
 ] 

Ray Gauss II commented on TIKA-1594:


I'd recommend that for now we trim since {{Metadata.IMAGE_*}} properties are 
defined as {{Property.internalInteger}}.

In the future I think we should consider changing to (or perhaps adding) more 
generally useful dimension properties, like {{Dimensions}} from the [additional 
properties of 
XMP|http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMPSpecificationPart2.pdf]
 (section 1.2.2.2) which includes a {{unit}} field.

 Webp parsing support
 

 Key: TIKA-1594
 URL: https://issues.apache.org/jira/browse/TIKA-1594
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Jan Kronquist

 webp content type is correctly detected, but parsing is not supported. 
 I noticed that metadata-extractor 2.8.0 supports webp:
 https://github.com/drewnoakes/metadata-extractor/issues/85
 However, Tika does currently not work with this version (I tried manually 
 overriding the dependency). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction

2015-03-01 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342547#comment-14342547
 ] 

Ray Gauss II commented on TIKA-634:
---

Also see the [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg].

There we recently had to patch {{ExternalParser}} for some stream parsing 
concurrency problems which should be raised in a separate issue here shortly.

 Command Line Parser for Metadata Extraction
 ---

 Key: TIKA-634
 URL: https://issues.apache.org/jira/browse/TIKA-634
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 0.9
Reporter: Nick Burch
Assignee: Nick Burch
Priority: Minor

 As discussed on the mailing list:
 http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E
 This issue is to track improvements in the ExternalParser support to handle 
 metadata extraction, and probably easier configuration of an external parser 
 too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files

2015-01-12 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273520#comment-14273520
 ] 

Ray Gauss II commented on TIKA-1510:


Yes.

The only reason I haven't myself is that I've been trying to find some time to 
refactor the vorbis stuff per the previous 
[conversation|http://mail-archives.apache.org/mod_mbox/tika-dev/201408.mbox/%3calpine.deb.2.02.1408221155450.8...@urchin.earth.li%3E]
 with [~gagravarr].

 FFMpeg installed but not parsing video files
 

 Key: TIKA-1510
 URL: https://issues.apache.org/jira/browse/TIKA-1510
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: FFMPEG, Mac OS X 10.9 with HomeBrew
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7


 I have FFMPEG installed with homebrew:
 {noformat}
 # brew install ffmpeg
 {noformat}
 I've got some AVI files and have tried to parse them with Tika:
 {noformat}
 [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI
 Content-Length: 334917340
 Content-Type: video/x-msvideo
 X-Parsed-By: org.apache.tika.parser.EmptyParser
 resourceName: SPOT11_01 17.AVI
 {noformat}
 I took a look at the ExternalParser, which is configured for using ffmpeg if 
 it's installed. It seems it only works on:
 {code:xml}
mime-types
mime-typevideo/avi/mime-type
mime-typevideo/mpeg/mime-type
  /mime-types
 {code}
 I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the 
 work by [~rgauss] at Github - Ray I noticed there is no parser in that work:
 https://github.com/AlfrescoLabs/tika-ffmpeg
 But there seems to be metadata extraction code, etc. Ray should I do 
 something with this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files

2015-01-11 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273049#comment-14273049
 ] 

Ray Gauss II commented on TIKA-1510:


In that project there is a 
[{{TikaIntrinsicAVFfmpegParserFactory}}|https://github.com/AlfrescoLabs/tika-ffmpeg/blob/master/src/main/java/org/apache/tika/parser/ffmpeg/TikaIntrinsicAVFfmpegParserFactory.java]
 which is used to set up an {{ExternalParser}}.

See the 
[{{TikaIntrinsicAVFfmpegParserTest}}|https://github.com/AlfrescoLabs/tika-ffmpeg/blob/master/src/test/java/org/apache/tika/parser/ffmpeg/TikaIntrinsicAVFfmpegParserTest.java]
 for an example of its use.

 FFMpeg installed but not parsing video files
 

 Key: TIKA-1510
 URL: https://issues.apache.org/jira/browse/TIKA-1510
 Project: Tika
  Issue Type: Bug
  Components: parser
 Environment: FFMPEG, Mac OS X 10.9 with HomeBrew
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7


 I have FFMPEG installed with homebrew:
 {noformat}
 # brew install ffmpeg
 {noformat}
 I've got some AVI files and have tried to parse them with Tika:
 {noformat}
 [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI
 Content-Length: 334917340
 Content-Type: video/x-msvideo
 X-Parsed-By: org.apache.tika.parser.EmptyParser
 resourceName: SPOT11_01 17.AVI
 {noformat}
 I took a look at the ExternalParser, which is configured for using ffmpeg if 
 it's installed. It seems it only works on:
 {code:xml}
mime-types
mime-typevideo/avi/mime-type
mime-typevideo/mpeg/mime-type
  /mime-types
 {code}
 I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the 
 work by [~rgauss] at Github - Ray I noticed there is no parser in that work:
 https://github.com/AlfrescoLabs/tika-ffmpeg
 But there seems to be metadata extraction code, etc. Ray should I do 
 something with this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-93) OCR support

2014-09-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134822#comment-14134822
 ] 

Ray Gauss II commented on TIKA-93:
--

You could use 
[{{org.junit.Assume}}|http://stackoverflow.com/questions/1689242/conditionally-ignoring-tests-in-junit-4]
 so the tests will be skipped rather than reported as passing.

Perhaps we should consider the Maven Failsafe Plugin as well?

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, 
 TesseractOCRParser.patch, TesseractOCR_Tyler.patch, 
 TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, testOCR.docx, 
 testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-93) OCR support

2014-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102175#comment-14102175
 ] 

Ray Gauss II commented on TIKA-93:
--

Can you create a config object and pass that in the {{ParseContext}}, similar 
to what 
[{{PDFParser}}|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java]
 does with a 
[{{PDFParserConfig}}|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java]
 entry?
{code}
//config from context, or default if not set via context
PDFParserConfig localConfig = context.get(PDFParserConfig.class, defaultConfig);
{code}

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
 TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, 
 testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-93) OCR support

2014-08-19 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102193#comment-14102193
 ] 

Ray Gauss II commented on TIKA-93:
--

Apologies, jumped in late and only glanced at the comment thread.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
 TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, 
 testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1328) Translate Metadata and Content

2014-06-10 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026783#comment-14026783
 ] 

Ray Gauss II commented on TIKA-1328:


Leaning towards the whitelist approach, perhaps we could add an 
{{isTranslatable}} field / method and corresponding constructor to the 
{{Property}} class (with a default of false) and update the properties we want 
to support translation on?

 Translate Metadata and Content
 --

 Key: TIKA-1328
 URL: https://issues.apache.org/jira/browse/TIKA-1328
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
 Fix For: 1.7


 Right now, Translation is only done on Strings. Ideally, users would be able 
 to turn on translation while parsing. I can think of a couple options:
 - Make a TranslateAutoDetectParser. Automatically detect the file type, parse 
 it, then translate the content.
 - Make a Context switch. When true, translate the content regardless of the 
 parser used. I'm not sure the best way to go about this method, but I prefer 
 it over another Parser.
 Regardless, we need a black or white list for translation. I think black list 
 would be the way to go -- which fields should not be translated (dates, 
 versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any 
 other open source translation libraries? If we were really lucky, it wouldn't 
 depend on an online service.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1320) extract text from jpeg in solr tika

2014-06-04 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017613#comment-14017613
 ] 

Ray Gauss II commented on TIKA-1320:


I'm not sure we have enough context in the description of this issue to help 
much here.

As [~thaichat04] points out, OCR is one way of obtaining text from an image, 
but there are also several forms of embedded metadata that can be extracted.

Is there specific text you're looking to extract?

 extract text from jpeg in solr tika
 ---

 Key: TIKA-1320
 URL: https://issues.apache.org/jira/browse/TIKA-1320
 Project: Tika
  Issue Type: New Feature
Reporter: muruganv
  Labels: features
   Original Estimate: 24h
  Remaining Estimate: 24h

 How to extract text from jpeg or image format or tiff in solr tika



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-29 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012393#comment-14012393
 ] 

Ray Gauss II commented on TIKA-1294:


Hi [~talli...@apache.org],

The changes look good, thanks!

One minor point on conventions: I think enums are typically uppercase?

 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
 ---

 Key: TIKA-1294
 URL: https://issues.apache.org/jira/browse/TIKA-1294
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Trivial
 Fix For: 1.6

 Attachments: TIKA-1294.patch, TIKA-1294v1.patch


 TIKA-1268 added the capability to extract embedded images as regular embedded 
 resources...a great feature!
 However, for some use cases, it might not be desirable to extract those types 
 of embedded resources.  I see two ways of allowing the client to choose 
 whether or not to extract those images:
 1) set a value in the metadata for the extracted images that identifies them 
 as embedded PDXObjectImages vs regular image attachments.  The client can 
 then choose not to process embedded resources with a given metadata value.
 2) allow the client to set a parameter in the PDFConfig object.
 My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [DISCUSS] Centralizing JSON handling of Metadata

2014-05-28 Thread Ray Gauss II
Hi Tim,

1) Sounds good to me.

2) I do think we want core as lean as possible, so my vote would be for a 
separate project/module, similar to what was done with tika-xmp.  Perhaps 
something like tika-serialization-json to indicate other formats may follow in 
the same precedence?

3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?

Just curious, any particular reason for GSON over Jackson?

Regards,

Ray


On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (talli...@mitre.org) wrote:
 All,
  
 Nick recommended I put the question to the dev list for discussion. It might 
 be useful  
 to centralize our json handling of Metadata. We are now currently using 
 different libraries  
 and doing different things in CLI and in tika-server.
  
 1) Do we want to centralize json handling of Metadata?
  
 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. 
 OTOH, GSON  
 is only 186k, but this would add potential for jar conflicts with folks 
 integrating Tika,  
 and it doesn't feel like a core function to me...it is a handy decorator for 
 applications.  
  
 3) Wherever it goes, what package do we want to put it in? I like Nick's 
 recommendations,  
 with a slight preference for the second (oat.utils.json).
  
 Thank you!
  
 Best,
  
 Tim
  
 -Original Message-
 From: Nick Burch (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, May 28, 2014 12:41 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
  
  
 [ 
 https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011287#comment-14011287
   
 ]
  
 Nick Burch commented on TIKA-1311:
 --
  
 If we put it into core, we'd need to add another dependency (to GSON) which 
 isn't ideal,  
 so we might want to run the plan past the dev list first to see what people 
 think (core tends  
 to try to have a very minimal set of deps, unlike the other modules)
  
 Package wise, org.apache.tika.metadata.json is what I'd lean towards, 
 otherwise  
 utils.json
  
  Centralize JSON handling of Metadata
  
 
  Key: TIKA-1311
  URL: https://issues.apache.org/jira/browse/TIKA-1311
  Project: Tika
  Issue Type: Task
  Reporter: Tim Allison
  Priority: Minor
 
  When json was initially added to TIKA CLI (TIKA-213), there was a 
  recommendation to  
 centralize JSON handling of Metadata, potentially putting it in core. On a 
 recent bug  
 fix (TIKA-1291), the same recommendation was repeated especially noting that 
 we now  
 handle JSON/Metadata differently in CLI and server.
  Let's centralize JSON handling in core and use GSON. We should add a 
  serializer and a  
 deserializer so that users don't have to reinvent that wheel.
  
  
  
 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)
  



RE: [DISCUSS] Centralizing JSON handling of Metadata

2014-05-28 Thread Ray Gauss II
I’ve used Jackson a bit but I don’t have a strong preference either.

I’m generally a fan of splitting things up into very small projects to keep the 
dependency hierarchy as clean as possible.  In this example, if we decided to 
do a direct serialization to, say, a Mongo DBObject in the future the json 
project wouldn’t need to bring in Mongo dependencies.  Apache Camel does a good 
job of segmenting things [1].

However, that sort of modularization is probably a broader discussion than what 
we need for this particular issue, so between those two I’d vote for 
tika-serialization.

Regards,

Ray


[1] 
https://git-wip-us.apache.org/repos/asf?p=camel.git;a=tree;f=components;h=1132bd1bb98a446aec97d5c7bc4d032276a65d83;hb=HEAD


On May 28, 2014 at 8:42:03 PM, Allison, Timothy B. (talli...@mitre.org) wrote:
 Thank you, Ray!
  
 In almost reverse order, I've been using Jackson for this already, but I used 
 GSON in TIKA-1291  
 because that's what CLI was already using. In GSON's favor, the jar is a bit 
 smaller, but  
 I have no real preference or reason to pick one over the other. I'm not a 
 json-blackbelt  
 (or, I guess that would be blckbelt), so I'm happy to go with either.
  
 A new compilation unit makes sense. I'm wondering if we want to be that 
 specific? tika-serialization?  
 Or, maybe just tika-utils?
  
 Package name looks good to me.
  
 Thanks, again!
  
 Best,
  
 Tim
  
 -Original Message-
 From: Ray Gauss II [mailto:ray.ga...@alfresco.com]
 Sent: Wednesday, May 28, 2014 3:07 PM
 To: dev@tika.apache.org; Allison, Timothy B.
 Subject: Re: [DISCUSS] Centralizing JSON handling of Metadata
  
 Hi Tim,
  
 1) Sounds good to me.
  
 2) I do think we want core as lean as possible, so my vote would be for a 
 separate project/module,  
 similar to what was done with tika-xmp. Perhaps something like 
 tika-serialization-json  
 to indicate other formats may follow in the same precedence?
  
 3) Similar to above, perhaps org.apache.tika.metadata.serialization.json?
  
 Just curious, any particular reason for GSON over Jackson?
  
 Regards,
  
 Ray
  
  
 On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (talli...@mitre.org) wrote:
  All,
 
  Nick recommended I put the question to the dev list for discussion. It 
  might be useful  
  to centralize our json handling of Metadata. We are now currently using 
  different libraries  
  and doing different things in CLI and in tika-server.
 
  1) Do we want to centralize json handling of Metadata?
 
  2) If so, where? Core? I share Nick's hesitance to add a dependency to 
  core. OTOH, GSON  
  is only 186k, but this would add potential for jar conflicts with folks 
  integrating  
 Tika,
  and it doesn't feel like a core function to me...it is a handy decorator 
  for applications.  
 
  3) Wherever it goes, what package do we want to put it in? I like Nick's 
  recommendations,  
  with a slight preference for the second (oat.utils.json).
 
  Thank you!
 
  Best,
 
  Tim
 
  -Original Message-
  From: Nick Burch (JIRA) [mailto:j...@apache.org]
  Sent: Wednesday, May 28, 2014 12:41 PM
  To: dev@tika.apache.org
  Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata
 
 
  [ 
  https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011287#comment-14011287

  ]
 
  Nick Burch commented on TIKA-1311:
  --
 
  If we put it into core, we'd need to add another dependency (to GSON) which 
  isn't ideal,  
  so we might want to run the plan past the dev list first to see what people 
  think (core tends  
  to try to have a very minimal set of deps, unlike the other modules)
 
  Package wise, org.apache.tika.metadata.json is what I'd lean towards, 
  otherwise  
  utils.json
 
   Centralize JSON handling of Metadata
   
  
   Key: TIKA-1311
   URL: https://issues.apache.org/jira/browse/TIKA-1311
   Project: Tika
   Issue Type: Task
   Reporter: Tim Allison
   Priority: Minor
  
   When json was initially added to TIKA CLI (TIKA-213), there was a 
   recommendation to  
  centralize JSON handling of Metadata, potentially putting it in core. On a 
  recent bug  
  fix (TIKA-1291), the same recommendation was repeated especially noting 
  that we now  
  handle JSON/Metadata differently in CLI and server.
   Let's centralize JSON handling in core and use GSON. We should add a 
   serializer and  
 a
  deserializer so that users don't have to reinvent that wheel.
 
 
 
  --
  This message was sent by Atlassian JIRA
  (v6.2#6252)
 
  
  



[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995298#comment-13995298
 ] 

Ray Gauss II commented on TIKA-1278:


Hi [~tallison],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued

2014-05-15 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995945#comment-13995945
 ] 

Ray Gauss II commented on TIKA-1295:


+1 for the data model more accurately reflecting the standard and for 
multilingual fields, but with a simple text bag how would you know which value 
corresponds to which language?

I think this is another example that highlights the need for a more structured 
underlying metadata store as mentioned in section IV of the [metadata 
roadmap|http://wiki.apache.org/tika/MetadataRoadmap].

 Make some Dublin Core items multi-valued
 

 Key: TIKA-1295
 URL: https://issues.apache.org/jira/browse/TIKA-1295
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.6


 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
 dc:title, dc:description and dc:rights should allow multiple values because 
 of language alternatives.  Unless anyone objects in the next few days, I'll 
 switch those to Property.toInternalTextBag() from Property.toInternalText().  
 I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995474#comment-13995474
 ] 

Ray Gauss II commented on TIKA-1294:


We ran into this exact issue recently and there is another method to achieve 
the same result without changing Tika code.

In {{ParsingEmbeddedDocumentExtractor.shouldParseEmbedded}} the 
{{ParseContext}} is checked for a {{DocumentSelector}}.

Since that extractor seems to be the only place that type is checked for 
(perhaps {{EmbeddedDocumentSelector}} would be a more appropriate name?) you 
can create one that suits your needs and set it as the document selector value 
in the {{ParseContext}}.

In our case we created a simple {{MediaTypeDisablingDocumentSelector}} that 
holds a list of {{disabledMediaTypes}}.

See 
[{{TikaGUI}}|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java]
 and its {{ImageDocumentSelector}} as a general example of document selector 
use.

 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
 ---

 Key: TIKA-1294
 URL: https://issues.apache.org/jira/browse/TIKA-1294
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial
 Attachments: TIKA-1294.patch


 TIKA-1268 added the capability to extract embedded images as regular embedded 
 resources...a great feature!
 However, for some use cases, it might not be desirable to extract those types 
 of embedded resources.  I see two ways of allowing the client to choose 
 whether or not to extract those images:
 1) set a value in the metadata for the extracted images that identifies them 
 as embedded PDXObjectImages vs regular image attachments.  The client can 
 then choose not to process embedded resources with a given metadata value.
 2) allow the client to set a parameter in the PDFConfig object.
 My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997500#comment-13997500
 ] 

Ray Gauss II commented on TIKA-1294:


I saw similar problematic resource consumption as well, which was the reason 
for figuring out how to disable this stuff :)

Perhaps a generic indication of why this embedded object is being parsed would 
be useful to have in the metadata object passed to the 
{{EmbeddedDocumentExtractor}}, something like an {{EmbeddedObjectContext}} enum 
with {{INLINE}} and {{ATTACHMENT}} options, which the 
{{EmbeddedDocumentExtractor}} (and in most cases that means the 
{{DocumentSelector}}) could use to determine whether to parse on a per-object 
basis? 

 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
 ---

 Key: TIKA-1294
 URL: https://issues.apache.org/jira/browse/TIKA-1294
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial
 Attachments: TIKA-1294.patch


 TIKA-1268 added the capability to extract embedded images as regular embedded 
 resources...a great feature!
 However, for some use cases, it might not be desirable to extract those types 
 of embedded resources.  I see two ways of allowing the client to choose 
 whether or not to extract those images:
 1) set a value in the metadata for the extracted images that identifies them 
 as embedded PDXObjectImages vs regular image attachments.  The client can 
 then choose not to process embedded resources with a given metadata value.
 2) allow the client to set a parameter in the PDFConfig object.
 My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued

2014-05-14 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997478#comment-13997478
 ] 

Ray Gauss II commented on TIKA-1295:


bq. I see that there is an ALT PropertyType. Are there any plans to implement 
that (or did I miss the implementation somewhere)

Not sure. On first glance I don't see it anywhere, nor any use of 
{{ValueType.LOCALE}}.

I think we'd need a design discussion on how best to implement multilingual 
properties, likely through some suffixing of property keys if we don't change 
the underlying metadata structure, or perhaps that discussion has already taken 
place?

 Make some Dublin Core items multi-valued
 

 Key: TIKA-1295
 URL: https://issues.apache.org/jira/browse/TIKA-1295
 Project: Tika
  Issue Type: Bug
Reporter: Tim Allison
Assignee: Tim Allison
Priority: Minor
 Fix For: 1.6


 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, 
 dc:title, dc:description and dc:rights should allow multiple values because 
 of language alternatives.  Unless anyone objects in the next few days, I'll 
 switch those to Property.toInternalTextBag() from Property.toInternalText().  
 I'll also modify PDFParser to extract dc:rights.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

2014-05-13 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995960#comment-13995960
 ] 

Ray Gauss II commented on TIKA-1294:


bq. Can your MediaTypeDisablingDocumentSelector tell the difference between a 
jpeg that was attached to a PDF (basic attachment) and one that was derived 
from a PDXObjectImage?

If by basic attachment you mean those defined in 
{{PDEmbeddedFilesNameTreeNode}}, then not exactly.

Both {{PDF2XHTML.extractImages}} and {{PDF2XHTML.extractEmbeddedDocuments}} end 
up using the same {{getEmbeddedDocumentExtractor}} (a 
{{ParsingEmbeddedDocumentExtractor}} by default) and use the same 
{{DocumentSelector}} in the calls to 
{{extractor.shouldParseEmbedded(metadata)}}, but neither sets any special 
metadata keys indicating 'attached' vs 'embedded' so document selectors aren't 
able to explicitly distinguish.

However, the {{PDXObjectImage}} resources *only* get the media type set in the 
metadata object while the {{PDEmbeddedFilesNameTreeNode}} resources get media 
type, name, and length set, so you could potentially check for their presence 
to distinguish.

 Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
 ---

 Key: TIKA-1294
 URL: https://issues.apache.org/jira/browse/TIKA-1294
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Trivial
 Attachments: TIKA-1294.patch


 TIKA-1268 added the capability to extract embedded images as regular embedded 
 resources...a great feature!
 However, for some use cases, it might not be desirable to extract those types 
 of embedded resources.  I see two ways of allowing the client to choose 
 whether or not to extract those images:
 1) set a value in the metadata for the extracted images that identifies them 
 as embedded PDXObjectImages vs regular image attachments.  The client can 
 then choose not to process embedded resources with a given metadata value.
 2) allow the client to set a parameter in the PDFConfig object.
 My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-05-12 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995298#comment-13995298
 ] 

Ray Gauss II edited comment on TIKA-1278 at 5/12/14 5:39 PM:
-

Hi [~talli...@apache.org],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.


was (Author: rgauss):
Hi [~tallison],

I thought about adding to {{PDFParser.properties}} but decided against it since 
PDFBox could change the default values or change the properties' scale or use, 
and if we weren't aware of that change we'd be inadvertently overriding those 
defaults.

Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work 
well for most people.

We can certainly reconsider setting those defaults and/or adding other config 
if there are particular parameters people would find useful.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1278:
--

 Summary: Expose PDF Avg Char and Spacing Tolerance Config Params
 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and it's configuration 
behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1278:
---

Description: 
{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and its configuration 
behavior.

  was:
{{PDFParserConfig}} should allow for override of PDFBox's 
{{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
comment in {{PDF2XHTML}}.

Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
slightly to allow for extension of that config class and it's configuration 
behavior.


 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1278.


Resolution: Fixed

Resolved in r1589722.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and it's configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params

2014-04-24 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979700#comment-13979700
 ] 

Ray Gauss II edited comment on TIKA-1278 at 4/24/14 1:31 PM:
-

Resolved in r1589722.

The setting of {{PDF2XHTML}} params was also moved from {{PDF2XHTML.process}} 
to a new {{PDFParserConfig.configure}} method which should allow developers to 
extend {{PDFParserConfig}} for custom behavior.


was (Author: rgauss):
Resolved in r1589722.

 Expose PDF Avg Char and Spacing Tolerance Config Params
 ---

 Key: TIKA-1278
 URL: https://issues.apache.org/jira/browse/TIKA-1278
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 {{PDFParserConfig}} should allow for override of PDFBox's 
 {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO 
 comment in {{PDF2XHTML}}.
 Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed 
 slightly to allow for extension of that config class and its configuration 
 behavior.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Reopened] (TIKA-1279) Missing return lines at output of SourceCodeParser

2014-04-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-1279:


  Assignee: Hong-Thai Nguyen

[~thaichat04], I believe we still have to support Java 6 and 
{{System.lineSeparator()}} appears to have been added in Java 7.

I think {{System.getProperty(line.separator)}} would be equivalent.

 Missing return lines at output of SourceCodeParser
 --

 Key: TIKA-1279
 URL: https://issues.apache.org/jira/browse/TIKA-1279
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Trivial
 Fix For: 1.6


 xhtml output is on a single line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-03-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1151.


Resolution: Fixed

Resolved in r1580887.

 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-03-24 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Fix Version/s: 1.6

 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.6


 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Description: 
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp



  was:
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp




 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.5-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1151:
---

Description: 
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.6-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp



  was:
The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-core
- tika-parsers
- tika-server
- tika-xmp




 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2014-02-20 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907100#comment-13907100
 ] 

Ray Gauss II commented on TIKA-1151:


This will create a few artifacts on the larger side, notably:
||Artifact||Size||
|tika-parsers-1.6-SNAPSHOT-tests.jar|33MB|
|tika-server-1.6-SNAPSHOT-tests.jar|6.8MB|

Not huge, but I thought I'd double check that no one has any issues with that 
before committing.

 Maven Build Should Automatically Produce test-jar Artifacts
 ---

 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II

 The Maven build should be updated to produce test jar artifacts for 
 appropriate sub-projects (see below) such that developers can extend test 
 classes by adding the {{test-jar}} artifact as a dependency, i.e.:
 {code}
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.6-SNAPSHOT/version
   typetest-jar/type
   scopetest/scope
 /dependency
 {code}
 The following sub-projects contain tests that developers might want to extend 
 and their corresponding {{pom.xml}} should have the [attached 
 tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
 - tika-app
 - tika-core
 - tika-parsers
 - tika-server
 - tika-xmp



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Extract thumbnail from openxml office files

2014-01-08 Thread Ray Gauss II
Hi Hong-Thai,

It’s certainly worth investigating.  Several other formats can have embedded 
thumbnails as well so we could implement a generic thumbnail property.

We could probably store as something like a Base64 encoded string, but we’d 
likely want to place limits on the size and may need a thumbnail internet media 
type field as well to assist in decoding.

Unless others feel differently, I would say open a JIRA where we could start 
discussing the design of such a feature.

Thanks!

Ray


On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen 
(hong-thai.ngu...@polyspot.com) wrote:
  
 Hi all,
 I want to extract thumbnail image included in Open XML office  
 files. Apparently, we can do it by openxml4j: 
 http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx
   
 The question is : should we integrate thumbnail in default metadata  
 list of ooxml parsing result ?
  
  
 Thanks
  
 Hong-Thai
  
  



[jira] [Assigned] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1177:
--

Assignee: Ray Gauss II

 Add Matroska (mkv, mka) format detection
 

 Key: TIKA-1177
 URL: https://issues.apache.org/jira/browse/TIKA-1177
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.4
Reporter: Boris Naguet
Assignee: Ray Gauss II
Priority: Minor

 There's no mimetype detection for Matroska format, although it's a popular 
 video format.
 Here is some code I added in my custom mimetypes to detect them:
 {code}
   mime-type type=video/x-matroska
   glob pattern=*.mkv /
   magic priority=40
   match value=0x1A45DFA3934282886d6174726f736b61 
 type=string offset=0 /
   /magic
   /mime-type
   mime-type type=audio/x-matroska
   glob pattern=*.mka /
   /mime-type
 {code}
 I found the signature for the mkv on: 
 http://www.garykessler.net/library/file_sigs.html
 I was not able to find it clearly for mka, but detection by filename is still 
 useful.
 Although, the full spec is available here:
 http://matroska.org/technical/specs/index.html
 Maybe it's a bit more complex than this constant magic, but it works on my 
 tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (TIKA-1177) Add Matroska (mkv, mka) format detection

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1177.


   Resolution: Fixed
Fix Version/s: 1.5

Unfortunately that magic doesn't seem to be required in all MKV files.  I tired 
several utilities to convert various sources to MKV and none contained that 
magic.

A magic value of {{0x1A45DFA3}} is present, but that's also present in WebM  
which is extended from Matroska.

I've added Matroska mime-types based on just extension for now and also added 
the WebM mime-type.

We can open other issues, linked to this one, for data detection of MKV and 
WebM files if need be.

Resolved in r1529260.

 Add Matroska (mkv, mka) format detection
 

 Key: TIKA-1177
 URL: https://issues.apache.org/jira/browse/TIKA-1177
 Project: Tika
  Issue Type: Improvement
  Components: mime
Affects Versions: 1.4
Reporter: Boris Naguet
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5


 There's no mimetype detection for Matroska format, although it's a popular 
 video format.
 Here is some code I added in my custom mimetypes to detect them:
 {code}
   mime-type type=video/x-matroska
   glob pattern=*.mkv /
   magic priority=40
   match value=0x1A45DFA3934282886d6174726f736b61 
 type=string offset=0 /
   /magic
   /mime-type
   mime-type type=audio/x-matroska
   glob pattern=*.mka /
   /mime-type
 {code}
 I found the signature for the mkv on: 
 http://www.garykessler.net/library/file_sigs.html
 I was not able to find it clearly for mka, but detection by filename is still 
 useful.
 Although, the full spec is available here:
 http://matroska.org/technical/specs/index.html
 Maybe it's a bit more complex than this constant magic, but it works on my 
 tests files.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Resolved] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser

2013-10-04 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1179.


Resolution: Cannot Reproduce
  Assignee: Ray Gauss II

I've just confirmed the described behavior in Tika 1.4, however, it appears the 
file is parsed just fine in 1.5!

You can verify by downloading a 1.5 snapshot of {{tika-app}} ([current 
link|https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.5-SNAPSHOT/tika-app-1.5-20130927.201341-30.jar]),
 running the app, i.e.:
{code}
java -jar tika-app-1.5-20130927.201341-30.jar
{code}
and dropping {{corrupt.mp3}} onto the app window.

 A corrupt mp3 file can cause an infinite loop in Mp3Parser
 --

 Key: TIKA-1179
 URL: https://issues.apache.org/jira/browse/TIKA-1179
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Marius Dumitru Florea
Assignee: Ray Gauss II
 Fix For: 1.5

 Attachments: corrupt.mp3


 I have a thread that indexes (among other things) files using Apache Sorl. 
 This thread hangs (still running but with no progress) when trying to extract 
 meta data from the mp3 file attached to this issue. Here are a couple of 
 thread dumps taken at various moments:
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at 
 org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63)
   at 
 org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77)
   at 
 org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.BufferedInputStream.fill(Unknown Source)
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb7094e8 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4618000]
java.lang.Thread.State: RUNNABLE
   at org.apache.tika.io.TailStream.skip(TailStream.java:133)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.Tika.parseToString(Tika.java:380)
   ...
 {noformat}
 {noformat}
 XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 
 runnable [0x7f46f4617000]
java.lang.Thread.State: RUNNABLE
   at java.io.BufferedInputStream.read1(Unknown Source)
   at java.io.BufferedInputStream.read(Unknown Source)
   - locked 0xcb1be170 (a java.io.BufferedInputStream)
   at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99)
   at java.io.FilterInputStream.read(Unknown Source)
   at org.apache.tika.io.TailStream.read(TailStream.java:117)
   at org.apache.tika.io.TailStream.skip(TailStream.java:140)
   at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283)
   at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160)
   at 
 org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193)
   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242

[jira] [Assigned] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1170:
--

Assignee: Ray Gauss II

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1170.


   Resolution: Fixed
Fix Version/s: 1.5

Added in r1519664.

Thanks!

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1375#comment-1375
 ] 

Ray Gauss II commented on TIKA-1170:


My mistake, that's an artifact of me manually applying the git patch.

It does, however, seem to indicate that we should have a unit test for the 
false positives.  Do you have a file which demonstrates that problem?

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-1170:



 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1170.


Resolution: Fixed

Resolved in r1519792.

SVN did not like the html extension on the problem file.

Thanks again.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

2013-09-03 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757000#comment-13757000
 ] 

Ray Gauss II commented on TIKA-1170:


Yes, but in this particular case I thought it might be better to explicitly 
change the file name so other developers don't fix the media type for that 
file in the future.

 Insufficiently specific magic for binary image/cgm files
 

 Key: TIKA-1170
 URL: https://issues.apache.org/jira/browse/TIKA-1170
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.5

 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, 
 plotutils-example.cgm


 I've been running Tika against a large corpus of web archives files, and I'm 
 seeing a number of false positives for image/cgm. The Tika magic is
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0/
 {code}
 The issue seems to be that the second magic matcher is not very specific, 
 e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
 matches out of 300 million resources, but it would be nice if this could be 
 tightened up. 
 Looking at the PRONOM signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures
 * 
 http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures
 it seems we have a variable position marker that changes slightly for each 
 version. Therefore, a more robust signature should be:
 {code}
   match value=BEGMF type=string offset=0/
   match value=0x0020 mask=0xffe0 type=string offset=0
 match value=0x10220001 type=string offset=2:64/
 match value=0x10220002 type=string offset=2:64/
 match value=0x10220003 type=string offset=2:64/
 match value=0x10220004 type=string offset=2:64/
   /match
 {code}
 Where I have assumed the filename part of the CGM file will be less that 64 
 characters long.
 Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1166) FLVParser NullPointerException

2013-08-28 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1166:
--

Assignee: Ray Gauss II

 FLVParser NullPointerException
 --

 Key: TIKA-1166
 URL: https://issues.apache.org/jira/browse/TIKA-1166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.2, 1.3, 1.4
 Environment: All
Reporter: david rapin
Assignee: Ray Gauss II
  Labels: easyfix
 Attachments: data.mp4

   Original Estimate: 10m
  Remaining Estimate: 10m

 On certain video files, the FLV parser throws an NPE on line 242.
 The piece of code causing this is the following:
 https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   metadata.set(entry.getKey(), entry.getValue().toString());
 243: }
 {noformat} 
 Which should probably be replaced by something like this:
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   if (entry.getValue() == null) continue;
 243:   metadata.set(entry.getKey(), entry.getValue().toString());
 244: }
 {noformat} 
 Exception trace :
 {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.video.FLVParser@58d9660d
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1166) FLVParser NullPointerException

2013-08-28 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1166.


   Resolution: Fixed
Fix Version/s: 1.5

I briefly tried a few methods of trimming the problem file's size but none 
reproduced the issue in the resulting file.

Committed a check for null in r1518318.

 FLVParser NullPointerException
 --

 Key: TIKA-1166
 URL: https://issues.apache.org/jira/browse/TIKA-1166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.2, 1.3, 1.4
 Environment: All
Reporter: david rapin
Assignee: Ray Gauss II
  Labels: easyfix
 Fix For: 1.5

 Attachments: data.mp4

   Original Estimate: 10m
  Remaining Estimate: 10m

 On certain video files, the FLV parser throws an NPE on line 242.
 The piece of code causing this is the following:
 https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   metadata.set(entry.getKey(), entry.getValue().toString());
 243: }
 {noformat} 
 Which should probably be replaced by something like this:
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   if (entry.getValue() == null) continue;
 243:   metadata.set(entry.getKey(), entry.getValue().toString());
 244: }
 {noformat} 
 Exception trace :
 {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.video.FLVParser@58d9660d
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1166) FLVParser NullPointerException

2013-08-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747529#comment-13747529
 ] 

Ray Gauss II commented on TIKA-1166:


Thanks.  Is there any chance you could get that down to under, say, 50k, while 
still demonstrating the failure so that we can include it in the dist and 
create a unit test against it?

 FLVParser NullPointerException
 --

 Key: TIKA-1166
 URL: https://issues.apache.org/jira/browse/TIKA-1166
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.1, 1.2, 1.3, 1.4
 Environment: All
Reporter: david rapin
  Labels: easyfix
 Attachments: data.mp4

   Original Estimate: 10m
  Remaining Estimate: 10m

 On certain video files, the FLV parser throws an NPE on line 242.
 The piece of code causing this is the following:
 https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   metadata.set(entry.getKey(), entry.getValue().toString());
 243: }
 {noformat} 
 Which should probably be replaced by something like this:
 {noformat}241: for (EntryString, Object entry : 
 extractedMetadata.entrySet()) {
 242:   if (entry.getValue() == null) continue;
 243:   metadata.set(entry.getKey(), entry.getValue().toString());
 244: }
 {noformat} 
 Exception trace :
 {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.video.FLVParser@58d9660d
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101)
 Caused by: java.lang.NullPointerException
 at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
 ... 5 more
 {noformat} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.

2013-07-26 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720694#comment-13720694
 ] 

Ray Gauss II commented on TIKA-1154:


I've been pushing the metadata-extractor Maven release through Sonatype thus 
far, but Mr. Noakes has been granted access there [1].

If there's no response to your Google code issue I can push a 2.6.2.1 release 
that upgrades xercesImpl to 2.11.0 which, on first look, compiles and has no 
test failures.


[1] https://issues.sonatype.org/browse/OSSRH-3948

 Tika hangs on format detection of malformed HTML file.
 --

 Key: TIKA-1154
 URL: https://issues.apache.org/jira/browse/TIKA-1154
 Project: Tika
  Issue Type: Bug
  Components: mime
Affects Versions: 1.4
Reporter: Andrew Jackson
Priority: Minor
 Attachments: tika-breaker.html


 We are using Tika on large web archives, which also happen to contain some 
 malformed files. In particular, we found a HTML file with binary characters 
 in the DOCTYPE declaration. This hangs Tika, either embedded or from the 
 command line, during format detection.
 An example file is attached.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Tika Core and Parsers Test Artifacts

2013-07-22 Thread Ray Gauss II
Hi Ken, 

Yes, by other tika projects I meant tika-app, tika-bundle, tika-xmp, etc., and 
yes each sub-project would end up with it's own test-jar.

It probably makes more sense to just add the plugin to each project 
individually.

Since there's been no opposition to the concept in general I'll create a JIRA 
issue where we can discuss the details.

Regards,

Ray


On Jul 21, 2013, at 3:25 PM, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Ray,
 
 On Jul 18, 2013, at 6:37am, Ray Gauss II wrote:
 
 Hi Ken,
 
 They recommend typetest-jar/type instead of classifier now [1], but yes.
 
 Thanks for the reference.
 
 Perhaps the other tika projects could benefit from this as well and it could 
 just go into tika-parent's build plugins.
 
 By other tika projects do you mean things like tika-app?
 
 And if it's in the tika-parent's build plugins, does that mean each 
 sub-project would wind up with its own corresponding test-jar?
 
 Thanks,
 
 -- Ken
 
 [1] http://maven.apache.org/guides/mini/guide-attached-tests.html
 
 
 On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote:
 
 Hi Ray,
 
 On Jul 18, 2013, at 5:14am, Ray Gauss II wrote:
 
 I don't recall if we've discussed this already (I did do a brief search 
 and didn't see anything).
 
 Is there any opposition to adding test-jar Maven artifacts for tika-core 
 and tika-parsers?
 
 Seems like it would be good to allow others to extend from tests there if 
 need be.
 
 +1
 
 I assume you're talking about adding a 
 tika-(core|parsers)-version-tests.jar, so that we'd pull it in via:
 
  dependency
   groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.4/version
  classifiertests/classifier
  scopetest/scope
  /dependency
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 
 
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 



[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2013-07-22 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1151:
--

 Summary: Maven Build Should Automatically Produce test-jar 
Artifacts
 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II


The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}
dependency
  groupIdorg.apache.tika/groupId
  artifactIdtika-parsers/artifactId
  version1.5-SNAPSHOT/version
  typetest-jar/type
  scopetest/scope
/dependency
{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Tika Core and Parsers Test Artifacts

2013-07-18 Thread Ray Gauss II
I don't recall if we've discussed this already (I did do a brief search and 
didn't see anything).

Is there any opposition to adding test-jar Maven artifacts for tika-core and 
tika-parsers?

Seems like it would be good to allow others to extend from tests there if need 
be.




Re: Tika Core and Parsers Test Artifacts

2013-07-18 Thread Ray Gauss II
Hi Ken,

They recommend typetest-jar/type instead of classifier now [1], but yes.

Perhaps the other tika projects could benefit from this as well and it could 
just go into tika-parent's build plugins.

Regards,

Ray


[1] http://maven.apache.org/guides/mini/guide-attached-tests.html


On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Ray,
 
 On Jul 18, 2013, at 5:14am, Ray Gauss II wrote:
 
 I don't recall if we've discussed this already (I did do a brief search and 
 didn't see anything).
 
 Is there any opposition to adding test-jar Maven artifacts for tika-core and 
 tika-parsers?
 
 Seems like it would be good to allow others to extend from tests there if 
 need be.
 
 +1
 
 I assume you're talking about adding a 
 tika-(core|parsers)-version-tests.jar, so that we'd pull it in via:
 
dependency
 groupIdorg.apache.tika/groupId
artifactIdtika-parsers/artifactId
version1.4/version
classifiertests/classifier
scopetest/scope
/dependency
 
 -- Ken
 
 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr
 
 
 
 
 



[jira] [Created] (TIKA-1147) Passing a File-Based TikaInputStream to ExternalEmbedder Delete

2013-07-17 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1147:
--

 Summary: Passing a File-Based TikaInputStream to ExternalEmbedder 
Delete
 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
Reporter: Ray Gauss II




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed

2013-07-17 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1147:
---

  Component/s: metadata
  Description: 
When an application using Tika passes {{InputStream}} objects to 
{{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
which is then deleted after embedding takes place.

However, if the application passes in a file-based {{TikaInputStream}} the 
embedder ends up dealing with directly with the original source file, which is 
then deleted after embedding takes place.
 Priority: Critical  (was: Major)
Affects Version/s: 1.4
 Assignee: Ray Gauss II
  Summary: File-Based TikaInputStreams are Deleted by 
ExternalEmbedder.embed  (was: Passing a File-Based TikaInputStream to 
ExternalEmbedder Delete)

 File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
 -

 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.4
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical

 When an application using Tika passes {{InputStream}} objects to 
 {{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
 which is then deleted after embedding takes place.
 However, if the application passes in a file-based {{TikaInputStream}} the 
 embedder ends up dealing with directly with the original source file, which 
 is then deleted after embedding takes place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed

2013-07-17 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1147.


   Resolution: Fixed
Fix Version/s: 1.5

Resolved in r1504302.

 File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
 -

 Key: TIKA-1147
 URL: https://issues.apache.org/jira/browse/TIKA-1147
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.4
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical
 Fix For: 1.5


 When an application using Tika passes {{InputStream}} objects to 
 {{ExternalEmbedder.embed}} the stream is usually read into a temporary file 
 which is then deleted after embedding takes place.
 However, if the application passes in a file-based {{TikaInputStream}} the 
 embedder ends up dealing with directly with the original source file, which 
 is then deleted after embedding takes place.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: RFC822Parser build error on gump

2013-06-28 Thread Ray Gauss II
I know very little about gump, but looking at the log the build seems to have 
skipped the mime4j artifacts altogether.


On Jun 25, 2013, at 6:25 PM, Nick Burch apa...@gagravarr.org wrote:

 Hi All
 
 Anyone have any idea about this compiler error on the tika parsers project as 
 hit by gump?
 http://vmgump.apache.org/gump/public/tika/tika-parsers/gump_work/build_tika_tika-parsers.html
 
 Gump notifications will hopefully start again soon, which'd let us find out 
 about breaking changes from upstream Apache projects in advance, so it'd be 
 good to get the build working ready!
 
 Nick



[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-13 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682644#comment-13682644
 ] 

Ray Gauss II commented on TIKA-1130:


I've created a unit test that reproduces the issue with a stripped down version 
of the original file.

Shall I comment out the actual test and commit?

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text

2013-06-13 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682924#comment-13682924
 ] 

Ray Gauss II commented on TIKA-1130:


Test file and method committed in r1492909.

This was just added onto {{OOXMLParserTest}} and named with a {{disabled}} 
prefix rather than using {{@Ignore}}.  I think we should start moving towards 
that for new test classes though.

 .docx text extract leaves out some portions of text
 ---

 Key: TIKA-1130
 URL: https://issues.apache.org/jira/browse/TIKA-1130
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.2, 1.3
 Environment: OpenJDK x86_64
Reporter: Daniel Gibby
Priority: Critical
 Attachments: Resume 6.4.13.docx


 When parsing a Microsoft Word .docx 
 (application/vnd.openxmlformats-officedocument.wordprocessingml.document), 
 certain portions of text remain unextracted.
 I have attached a .docx file that can be tested against. The 'gray' portions 
 of text are what are not extracted, while the darker colored text extracts 
 fine.
 Looking at the document.xml portion of the .docx zip file shows the text is 
 all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition

2013-06-11 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1135.


Resolution: Fixed

Resolved in r1491935.

 Incorrect Cardinality and Case in IPTC Metadata Definition
 --

 Key: TIKA-1135
 URL: https://issues.apache.org/jira/browse/TIKA-1135
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.4


 Some of the fields defined in the {{IPTC}} interface have incorrect 
 cardinality and metadata key names with incorrect case.
 The change of key names should be done though composite properties which 
 include deprecated versions of the incorrect names as secondary properties 
 for backwards compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition

2013-06-11 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1135:
--

 Summary: Incorrect Cardinality and Case in IPTC Metadata Definition
 Key: TIKA-1135
 URL: https://issues.apache.org/jira/browse/TIKA-1135
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Minor
 Fix For: 1.4


Some of the fields defined in the {{IPTC}} interface have incorrect cardinality 
and metadata key names with incorrect case.

The change of key names should be done though composite properties which 
include deprecated versions of the incorrect names as secondary properties for 
backwards compatibility.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1133:
--

 Summary: Ability to Allow Empty and Duplicate Tika Values for XML 
Elements
 Key: TIKA-1133
 URL: https://issues.apache.org/jira/browse/TIKA-1133
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II


In some cases it is beneficial to allow empty and duplicate Tika metadata 
values for multi-valued XML elements like RDF bags.

Consider an example where the original source metadata is structured something 
like:
{code}
Person
  FirstNameJohn/FirstName
  LastNameSmith/FirstName
/Person
Person
  FirstNameJane/FirstName
  LastNameDoe/FirstName
/Person
Person
  FirstNameBob/FirstName
/Person
Person
  FirstNameKate/FirstName
  LastNameSmith/FirstName
/Person
{code}

and since Tika stores only flat metadata we transform that before invoking a 
parser to something like:
{code}
 custom:FirstName
  rdf:Bag
   rdf:liJohn/rdf:li
   rdf:liJane/rdf:li
   rdf:liBob/rdf:li
   rdf:liKate/rdf:li
  /rdf:Bag
 /custom:FirstName
 custom:LastName
  rdf:Bag
   rdf:liSmith/rdf:li
   rdf:liDoe/rdf:li
   rdf:li/rdf:li
   rdf:liSmith/rdf:li
  /rdf:Bag
 /custom:LastName
{code}

The current behavior ignores empties and duplicates and we don't know if Bob or 
Kate ever had last names.  Empties or duplicates in other positions result in 
an incorrect mapping of data.

We should allow the option to create an {{ElementMetadataHandler}} which allows 
empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements

2013-06-10 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1133.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1491680.

 Ability to Allow Empty and Duplicate Tika Values for XML Elements
 -

 Key: TIKA-1133
 URL: https://issues.apache.org/jira/browse/TIKA-1133
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
 Fix For: 1.4


 In some cases it is beneficial to allow empty and duplicate Tika metadata 
 values for multi-valued XML elements like RDF bags.
 Consider an example where the original source metadata is structured 
 something like:
 {code}
 Person
   FirstNameJohn/FirstName
   LastNameSmith/FirstName
 /Person
 Person
   FirstNameJane/FirstName
   LastNameDoe/FirstName
 /Person
 Person
   FirstNameBob/FirstName
 /Person
 Person
   FirstNameKate/FirstName
   LastNameSmith/FirstName
 /Person
 {code}
 and since Tika stores only flat metadata we transform that before invoking a 
 parser to something like:
 {code}
  custom:FirstName
   rdf:Bag
rdf:liJohn/rdf:li
rdf:liJane/rdf:li
rdf:liBob/rdf:li
rdf:liKate/rdf:li
   /rdf:Bag
  /custom:FirstName
  custom:LastName
   rdf:Bag
rdf:liSmith/rdf:li
rdf:liDoe/rdf:li
rdf:li/rdf:li
rdf:liSmith/rdf:li
   /rdf:Bag
  /custom:LastName
 {code}
 The current behavior ignores empties and duplicates and we don't know if Bob 
 or Kate ever had last names.  Empties or duplicates in other positions result 
 in an incorrect mapping of data.
 We should allow the option to create an {{ElementMetadataHandler}} which 
 allows empty and/or duplicate values.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?

2013-06-07 Thread Ray Gauss II
I think the Parser interface Javadoc would make sense as a place to document, 
but I don't know if there is an existing policy.

We'll certainly need to consider things like DelegatingParsers which may be 
using other parsers to do portions of the work.

Not the principle comment you were looking for, but my 2 cents.

Ray

On Jun 7, 2013, at 7:30 AM, Christian Reuschling reuschl...@dfki.uni-kl.de 
wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 it would be very interesting if somebody has a principle comment on this 
 thread...
 
 
 On 29.05.2013 14:42, Nick Burch wrote:
 On Wed, 29 May 2013, Christian Reuschling wrote:
 Nevertheless, in this case an Exception (like in all other parsers) or a 
 tika body with
 length zero, which is indicated at least by handler.endDocument() would be 
 the appropriate
 way, isn't it? - From the ContentHandlers point of view, there is nothing 
 in between.
 
 I'm not sure if we do have a properly documented policy on what a parser 
 should do if it
 receives a file it can't handle. For ones that are invalid (eg corrupt), I 
 believe an exception
 is the expected result. The case when the file seems valid, but can't be 
 handled by the parser,
 not sure
 
 Does anyone know if we have a policy on this, and/or where we should 
 document it?
 
 Nick
 
 - -- 
 __
 Christian Reuschling, Dipl.-Ing.(BA)
 Software Engineer
 
 Knowledge Management Department
 German Research Center for Artificial Intelligence DFKI GmbH
 Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
 
 Phone: +49.631.20575-1250
 mailto:reuschl...@dfki.de  http://www.dfki.uni-kl.de/~reuschling/
 
 - Legal Company Information Required by German 
 Law--
 Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
  Dr. Walter Olthoff
 Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
 Amtsgericht Kaiserslautern, HRB 2313=
 __
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.19 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
 
 iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp
 tRAAoIbAkUjqZujYM/BHINMmbhNswir9
 =a1xL
 -END PGP SIGNATURE-



[jira] [Assigned] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1115:
--

Assignee: Ray Gauss II

 ExifHandler throws NullPointerException
 ---

 Key: TIKA-1115
 URL: https://issues.apache.org/jira/browse/TIKA-1115
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
 Environment: verified on Mac OSX and Ubuntu 12.04
Reporter: Lee Graber
Assignee: Ray Gauss II
  Labels: ImageMetadataExtractor
 Attachments: 654000main_transit-hubble-orig_full.jpg

   Original Estimate: 2h
  Remaining Estimate: 2h

 Notice that in the second if block, there is no check for null on the 
 retrived datetime. I have hit this with a file which apparently has null for 
 this value. Seems like the fix is trivial
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
 Date datetime = 
 directory.getDate(ExifIFD0Directory.TAG_DATETIME);
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(datetime);
 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
 // If Date/Time Original does not exist this might be 
 creation date
 if (metadata.get(TikaCoreProperties.CREATED) == null) {
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 }
 }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646709#comment-13646709
 ] 

Ray Gauss II commented on TIKA-1115:


Hi Lee,

Do we have permission to include the problem file at a greatly reduced size, 
say 64px wide, as a test file?

 ExifHandler throws NullPointerException
 ---

 Key: TIKA-1115
 URL: https://issues.apache.org/jira/browse/TIKA-1115
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
 Environment: verified on Mac OSX and Ubuntu 12.04
Reporter: Lee Graber
Assignee: Ray Gauss II
  Labels: ImageMetadataExtractor
 Attachments: 654000main_transit-hubble-orig_full.jpg

   Original Estimate: 2h
  Remaining Estimate: 2h

 Notice that in the second if block, there is no check for null on the 
 retrived datetime. I have hit this with a file which apparently has null for 
 this value. Seems like the fix is trivial
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
 Date datetime = 
 directory.getDate(ExifIFD0Directory.TAG_DATETIME);
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(datetime);
 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
 // If Date/Time Original does not exist this might be 
 creation date
 if (metadata.get(TikaCoreProperties.CREATED) == null) {
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 }
 }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1115) ExifHandler throws NullPointerException

2013-05-01 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1115.


   Resolution: Fixed
Fix Version/s: 1.4

Resolved in r1478111

 ExifHandler throws NullPointerException
 ---

 Key: TIKA-1115
 URL: https://issues.apache.org/jira/browse/TIKA-1115
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.3
 Environment: verified on Mac OSX and Ubuntu 12.04
Reporter: Lee Graber
Assignee: Ray Gauss II
  Labels: ImageMetadataExtractor
 Fix For: 1.4

 Attachments: 654000main_transit-hubble-orig_full.jpg

   Original Estimate: 2h
  Remaining Estimate: 2h

 Notice that in the second if block, there is no check for null on the 
 retrived datetime. I have hit this with a file which apparently has null for 
 this value. Seems like the fix is trivial
 public void handleDateTags(Directory directory, Metadata metadata)
 throws MetadataException {
 // Date/Time Original overrides value from 
 ExifDirectory.TAG_DATETIME
 Date original = null;
 if 
 (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) {
 original = 
 directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL);
 // Unless we have GPS time we don't know the time zone so 
 date must be set
 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
 if (original != null) {
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor 
 uses
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
 }
 }
 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
 Date datetime = 
 directory.getDate(ExifIFD0Directory.TAG_DATETIME);
 String datetimeNoTimeZone = 
 DATE_UNSPECIFIED_TZ.format(datetime);
 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
 // If Date/Time Original does not exist this might be 
 creation date
 if (metadata.get(TikaCoreProperties.CREATED) == null) {
 metadata.set(TikaCoreProperties.CREATED, 
 datetimeNoTimeZone);
 }
 }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Build failed in Jenkins: Tika-trunk #994

2013-05-01 Thread Ray Gauss II
Looks like a possible build server problem.  Does anyone have access to 
manually trigger another build?

Regards,

Ray

On May 1, 2013, at 5:01 PM, Apache Jenkins Server  jenk...@builds.apache.org 
wrote:

 See https://builds.apache.org/job/Tika-trunk/994/changes



Re: Build failed in Jenkins: Tika-trunk #994

2013-05-01 Thread Ray Gauss II
 Subject: Jenkins build is back to normal : Tika-trunk #995

Yay, thanks!


On May 1, 2013, at 5:24 PM, Michael McCandless luc...@mikemccandless.com 
wrote:

 I just kicked off another build ... (it's queued).
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 
 On Wed, May 1, 2013 at 5:12 PM, Ray Gauss II ray.ga...@alfresco.com wrote:
 Looks like a possible build server problem.  Does anyone have access to 
 manually trigger another build?
 
 Regards,
 
 Ray
 
 On May 1, 2013, at 5:01 PM, Apache Jenkins Server  
 jenk...@builds.apache.org wrote:
 
 See https://builds.apache.org/job/Tika-trunk/994/changes
 



[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document

2013-02-22 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584194#comment-13584194
 ] 

Ray Gauss II commented on TIKA-1074:


bq. But it's a little weird throw TikaExc in response to an interrupt (ie, code 
above will be trying to catch an IE) ... I think it's cleaner to set the 
interrupt bit and let the next place that waits see the interrupt bit and throw 
IE?

That's what I found in my investigation for TIKA-775 / TIKA-1059 as well.

 Extraction should continue if an exception is hit visiting an embedded 
 document
 ---

 Key: TIKA-1074
 URL: https://issues.apache.org/jira/browse/TIKA-1074
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 1.4

 Attachments: TIKA-1074.patch, TIKA-1074.patch


 Spinoff from TIKA-1072.
 In that issue, a problematic document (still not sure if document is corrupt, 
 or possible POI bug) caused an exception when visiting the embedded documents.
 If I change Tika to suppress that exception, the rest of the document 
 extracts fine.
 So somehow I think we should be more robust here, and maybe log the 
 exception, or save/record the exception(s) somewhere so after parsing the app 
 could decide what to do about them ...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1068) Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data

2013-01-30 Thread Ray Gauss II (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13566693#comment-13566693
 ] 

Ray Gauss II commented on TIKA-1068:


I can't reproduce this using tika-app from either the download distribution or 
compiled from source.

We're using the 2.6.2 metadata-extractor jar from Maven central repository [1].

I'm not sure how your build is structured but perhaps you're including a 2.6.2 
metadata-extractor jar you've downloaded from elsewhere?  If so, can you try 
replacing that with the one on Maven central? 


[1] 
http://search.maven.org/#artifactdetails%7Ccom.drewnoakes%7Cmetadata-extractor%7C2.6.2%7Cjar

 Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data
 --

 Key: TIKA-1068
 URL: https://issues.apache.org/jira/browse/TIKA-1068
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Magnus Lövgren
Priority: Critical
 Attachments: vinter080501-66.jpg


 Using Tika 1.3, parsing of jpg files throws NoSuchMethodError when the jpg 
 contains xmp data. No Error was thrown in Tika 1.2.
 The metadata-extractor was updated in Tika 1.3 (to 
 com.drewnoakes:metadata-extractor:2.6.2), See TIKA-811 (duplicated by 
 TIKA-996). That jar is badly compiled (as mentioned by Emmanuel Hugonnet as 
 comment on TIKA-915) and causes the NoSuchMethodError!
 = the metadata-extractor 2.6.2 jar needs to be replaced! Problem seems fixed 
 in metadata-extractor 2.7.0, but that isn't released yet.
 Discussions available at:
 http://code.google.com/p/metadata-extractor/issues/detail?id=39
 http://code.google.com/p/metadata-extractor/issues/detail?id=55
 Code to reproduce problem:
 =
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-core/artifactId
   version1.3/version
 /dependency
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-xmp/artifactId
   version1.3/version
 /dependency
 dependency
   groupIdorg.apache.tika/groupId
   artifactIdtika-parsers/artifactId
   version1.3/version
 /dependency
 InputStream inputStream = ... // vinter080501-66.jpg file (attached)
 ContentHandler contentHandler = new BodyContentHandler(200);
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 Parser parser = new AutoDetectParser();
 parser.parse(inputStream, contentHandler, metadata, context); // Throws 
 NoSuchMethodError
 = java.lang.NoSuchMethodError: 
 com.adobe.xmp.properties.XMPPropertyInfo.getValue()Ljava/lang/Object;
   at com.drew.metadata.xmp.XmpReader.extract(Unknown Source)
   at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(Unknown
  Source)
   at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown Source)
   at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
   at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [VOTE] Apache Tika 1.3 Release Candidate #1

2013-01-20 Thread Ray Gauss II
Built on OS X, updated tika-exiftool to depend on 1.3 which compiled and passed 
tests.

+1 for release!

Cheers,

Ray


On Jan 18, 2013, at 11:30 PM, Dave Meikle loo...@gmail.com wrote:

 Hi Guys,
 
 A candidate for the Tika 1.3 release is available at:
 
http://people.apache.org/~dmeikle/apache-tika-1.3-rc1/
 
 The release candidate is a zip archive of the sources in:
 
http://svn.apache.org/repos/asf/tika/tags/tika-1.3/
 
 The SHA1 checksum of the archive is a80e45d1976e655381d6e93b50b9c7b118e9d6fc.
 
 A staged M2 repository can also be found on repository.apache.org here:
 
 https://repository.apache.org/content/repositories/orgapachetika-147/
 
 Please vote on releasing this package as Apache Tika 1.3.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.
 
[ ] +1 Release this package as Apache Tika 1.3
[ ] -1 Do not release this package because...
 
 Here is my +1 for the release.
 
 Cheers,
 Dave



[jira] [Created] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-01-18 Thread Ray Gauss II (JIRA)
Ray Gauss II created TIKA-1059:
--

 Summary: Better Handling of InterruptedException in ExternalParser 
and ExternalEmbedder
 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.4


The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
{{InterruptedException}} and ignore it.

The methods should either call {{interrupt()}} on the current thread or 
re-throw the exception, possibly wrapped in a {{TikaException}}.

See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-775) Embed Capabilities

2013-01-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-775.
---

   Resolution: Fixed
Fix Version/s: (was: 1.4)
   1.3
 Assignee: Ray Gauss II

 Embed Capabilities
 --

 Key: TIKA-775
 URL: https://issues.apache.org/jira/browse/TIKA-775
 Project: Tika
  Issue Type: Improvement
  Components: general, metadata
Affects Versions: 1.0
 Environment: The default ExternalEmbedder requires that sed be 
 installed.
Reporter: Ray Gauss II
Assignee: Ray Gauss II
  Labels: embed, patch
 Fix For: 1.3

 Attachments: embed_20121029.diff, embed.diff, 
 tika-core-embed-patch.txt, tika-parsers-embed-patch.txt


 This patch defines and implements the concept of embedding tika metadata into 
 a file stream, the reverse of extraction.
 In the tika-core project an interface defining an Embedder and a generic sed 
 ExternalEmbedder implementation meant to be extended or configured are added. 
  These classes are essentially a reverse flow of the existing Parser and 
 ExternalParser classes.
 In the tika-parsers project an ExternalEmbedderTest unit test is added which 
 uses the default ExternalEmbedder (calls sed) to embed a value placed in 
 Metadata.DESCRIPTION then verify the operation by parsing the resulting 
 stream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder

2013-01-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II updated TIKA-1059:
---

Issue Type: Improvement  (was: Bug)

 Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
 --

 Key: TIKA-1059
 URL: https://issues.apache.org/jira/browse/TIKA-1059
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
 Fix For: 1.4


 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch 
 {{InterruptedException}} and ignore it.
 The methods should either call {{interrupt()}} on the current thread or 
 re-throw the exception, possibly wrapped in a {{TikaException}}.
 See TIKA-775 for a previous discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (TIKA-1056) unify ImageMetadataExtractor interface

2013-01-16 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reassigned TIKA-1056:
--

Assignee: Ray Gauss II

 unify ImageMetadataExtractor interface
 --

 Key: TIKA-1056
 URL: https://issues.apache.org/jira/browse/TIKA-1056
 Project: Tika
  Issue Type: Wish
Reporter: Maciej Lizewski
Assignee: Ray Gauss II
Priority: Trivial

 there are several methods in this class that are targeted for different image 
 type but with different visibility:
 public void parseJpeg(File file);
 protected void parseTiff(InputStream stream);
 both simply extract all possible metadata from image file or stream. Would be 
 nice if parseTiff could also be public so it will be easier to create 
 custom parsers located in external jars that use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-1056) unify ImageMetadataExtractor interface

2013-01-16 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-1056.


   Resolution: Fixed
Fix Version/s: 1.3

Resolved in r1434117.

 unify ImageMetadataExtractor interface
 --

 Key: TIKA-1056
 URL: https://issues.apache.org/jira/browse/TIKA-1056
 Project: Tika
  Issue Type: Wish
Reporter: Maciej Lizewski
Assignee: Ray Gauss II
Priority: Trivial
 Fix For: 1.3


 there are several methods in this class that are targeted for different image 
 type but with different visibility:
 public void parseJpeg(File file);
 protected void parseTiff(InputStream stream);
 both simply extract all possible metadata from image file or stream. Would be 
 nice if parseTiff could also be public so it will be easier to create 
 custom parsers located in external jars that use this functionality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-962) Backwards Compatibility for Metadata.LAST_AUTHOR is Broken

2013-01-08 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-962.
---

Resolution: Fixed

This has been fixed, but I didn't resolve for 1.3 as I thought it might be 
worthy of a fix release.

 Backwards Compatibility for Metadata.LAST_AUTHOR is Broken
 --

 Key: TIKA-962
 URL: https://issues.apache.org/jira/browse/TIKA-962
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.2
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical
 Fix For: 1.3


 As a result of changes in TIKA-930, support for the deprecated 
 Metadata.LAST_AUTHOR property has been dropped.
 The new TikaCoreProperties.MODIFIED should be a composite property containing 
 Metadata.LAST_AUTHOR.
 Should we consider a fix release for this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-963) Backwards Compatibility for Metadata.DATE is Incorrect

2013-01-08 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-963.
---

Resolution: Fixed

This has been fixed, but I didn't resolve for 1.3 as I thought it might be 
worthy of a fix release.

 Backwards Compatibility for Metadata.DATE is Incorrect
 --

 Key: TIKA-963
 URL: https://issues.apache.org/jira/browse/TIKA-963
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.2
Reporter: Ray Gauss II
Assignee: Ray Gauss II
Priority: Critical
 Fix For: 1.3


 Metadata.DATE was always somewhat ambiguous, but during the consolidation in 
 TIKA-930 it was incorrectly assumed that most parsers used it as a creation 
 date.
 Metadata.DATE needs to instead be part of the TikaCoreProperties.MODIFIED 
 composite property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [DISCUSS] Release Candidate for 1.3?

2013-01-08 Thread Ray Gauss II
The code for TIKA-775 [1] is on trunk but it was re-opened with some concerns, 
some of which were addressed and some of which are still open discussions, 
though I think minor enough to create separate issues if need be and resolve 
TIKA-775 as fixed.

[1] https://issues.apache.org/jira/browse/TIKA-775


On Jan 8, 2013, at 4:56 PM, Dave Meikle loo...@gmail.com wrote:

 Hi All,
 
 We have got some new features and bugs fixed with a couple of outstanding 
 binary compatibility ones (TIKA-962, TIKA-963) fixed on trunk, so I was 
 wondering if it was time for a 1.3 release?
 
 Also, happy to do the Release Management for it.
 
 Cheers,
 Dave



[jira] [Resolved] (TIKA-895) Empty title element makes Tika-generated HTML documents not open

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-895.
---

Resolution: Duplicate
  Assignee: Ray Gauss II

 Empty title element makes Tika-generated HTML documents not open
 

 Key: TIKA-895
 URL: https://issues.apache.org/jira/browse/TIKA-895
 Project: Tika
  Issue Type: Bug
  Components: metadata
Affects Versions: 1.1
 Environment: Windows 7 
Reporter: Benoit MAGGI
Assignee: Ray Gauss II
Priority: Trivial
  Labels: newbie

 I try to transform an empty docx to an html file.
 Ex : java -jar tika-app-1.1.jar -x example.docx  t.html
 The html file can't be open with Firefox,Internet Explorer and Chrome.
 The main point is that title/ seems to be forbiden by html specification 
 (can't get the point on html5)
 bq. http://www.w3.org/TR/html401/struct/global.html#h-7.4.2 
 bq. 7.4.2 The TITLE element 
 bq. !-- The TITLE element is not considered part of the flow of text.
 bq.It should be displayed, for example as the page header or
 bq.window title. Exactly one title is required per document.
 bq. --
 bq. !ELEMENT TITLE 
 http://www.w3.org/TR/html401/struct/global.html#edef-TITLE  - - (#PCDATA) 
 -(%head.misc; 
 bq. http://www.w3.org/TR/html401/sgml/dtd.html#head.misc ) -- document 
 title --
 bq. !ATTLIST TITLE %i18n http://www.w3.org/TR/html401/sgml/dtd.html#i18n 
 bq. *Start tag: required, End tag: required*
 For information there was the same bug with xls
 https://issues.apache.org/jira/browse/TIKA-725
 The simple solution should be to provide an empty title by default

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Reopened] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II reopened TIKA-725:
---

  Assignee: Ray Gauss II  (was: Jukka Zitting)

Confirmed that the problem remains when a {{TransformerHandler}} is used, such 
those obtained from {{SAXTransformerFactory}} in {{TikaCLI}} and {{TikaGUI}}.

I've investigated and developed a workaround.

 Empty title element makes Tika-generated HTML documents not open in Chromium
 

 Key: TIKA-725
 URL: https://issues.apache.org/jira/browse/TIKA-725
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 0.9
 Environment: Chromium 12 on Ubuntu Linux
Reporter: Henri Bergius
Assignee: Ray Gauss II
Priority: Minor
  Labels: html
 Fix For: 0.10


 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
 empty title element as title/ into the document HEAD section. This causes 
 Chromium not to display the document contents.
 Switching it to title/title fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-914) Invalid self-closing title tag when parsing an RTF file

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-914.
---

Resolution: Duplicate
  Assignee: Ray Gauss II

 Invalid self-closing title tag when parsing an RTF file
 ---

 Key: TIKA-914
 URL: https://issues.apache.org/jira/browse/TIKA-914
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.1
 Environment: Reproduced on Linux and Windows
Reporter: Nicolas Guillaumin
Assignee: Ray Gauss II
Priority: Minor
  Labels: rtf
 Attachments: test.rtf


 When parsing an RTF file with an empty TITLE metadata, the resulting HTML 
 contains an self-closing title tag:
 {code}
 $ java -jar tika-app-1.1.jar -h test.rtf
 html xmlns=http://www.w3.org/1999/xhtml;
 head
 meta name=Content-Length content=830468/
 meta name=Content-Type content=application/rtf/
 meta name=resourceName content=test.rtf/
 title/
 /head
 [...]
 {code}
 I believe self-closing tags are not valid in XHTML, according to 
 http://www.w3.org/TR/xhtml1/#C_3 (However there's no XHTML doctype generated 
 here, just a namespace...). Anyway this causes some browsers like Chrome to 
 fail parsing the HTML, resulting in a blank page displayed.
 The expected output would be a non self-closing empty tag: {{title/title}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium

2012-12-18 Thread Ray Gauss II (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ray Gauss II resolved TIKA-725.
---

   Resolution: Fixed
Fix Version/s: 1.3


When a {{TransformerHandler}} is used the actual writing of the final elements 
is delegated to an XML serializer such as {{ToHTMLStream}} which extends 
{{ToStream}}.

When {{ToStream.characters}} is called with zero length it returns immediately 
and does not close the start tag of the current element, and 
{{ToStream.endElement}} checks whether the start tag is open to determine 
whether or not to close as {{title/}} or {{title/title}}.

It seems the code brought over from the xalan project to the JDK was locked 
down quite a bit during the transition.  When using xalan directly an alternate 
XML serializer can be specified via XSLT or other means [1], but in the JDK 
that functionality seems to have been removed as 
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream 
hard-coded.

Additionally, ToHTMLStream is declared as final and the majority of the classes 
which one would normally extend to use a different 
{{TransletOutputHandlerFactory}} are internal, so a proper solution would 
likely involve depending on xalan directly or duplicating a whole lot of code, 
neither of which is ideal.

As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator 
was added which checks for the previous fix for this issue, a call to 
{{characters(new char[0], 0, 0)}} for the title element, and if present changes 
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} 
thrown by {{ToStream.characters}}.

The result is that the title start tag is closed since the check for zero 
length passes and no character writing is attempted.

{{TikaCLI}} was modified to wrap the transformer handler returned by 
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of 
the {{title}} tag for HTML output will be affected by the change.

In the event that this approach has adverse effects for those using XML 
serializers other than those present in the JDK, the change to {{TikaCLI}} can 
be reverted or made an option.

Those calling Tika programmatically will need to wrap their transformer 
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:

SAXTransformerFactory factory = (SAXTransformerFactory) 
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, html);
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);

Resolved in r1423538.


[1] http://xml.apache.org/xalan-j/usagepatterns.html

 Empty title element makes Tika-generated HTML documents not open in Chromium
 

 Key: TIKA-725
 URL: https://issues.apache.org/jira/browse/TIKA-725
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 0.9
 Environment: Chromium 12 on Ubuntu Linux
Reporter: Henri Bergius
Assignee: Ray Gauss II
Priority: Minor
  Labels: html
 Fix For: 1.3, 0.10


 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an 
 empty title element as title/ into the document HEAD section. This causes 
 Chromium not to display the document contents.
 Switching it to title/title fixes this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   >