Re: renaming master?
Hi all, Apologies for not being able to be very involved over the past few years, but still trying to follow along and hoping to get time to contribute in the future. Another option might be ‘stable’? - Ray > On Jun 16, 2020, at 1:31 PM, Tim Allison wrote: > > All, > > As you may have seen, there's a movement to rename the "master" branch to > "main" or "trunk" (at least in the U.S.)[1][2]. Github is doing this, and > I personally think this makes sense. > > Are there any objections if we change "master"? If we do change it, is > there a preference for "main", "trunk" or something else? > > My personal preference would be for trunk, but I'm open. > > Best, > > Tim > > [1] > https://www.zdnet.com/article/github-to-replace-master-with-alternative-term-to-avoid-slavery-references/ > [2] https://www.bbc.com/news/technology-53050955
[jira] [Commented] (TIKA-2056) Installing exiftool causes ForkParserIntegration test errors
[ https://issues.apache.org/jira/browse/TIKA-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436705#comment-15436705 ] Ray Gauss II commented on TIKA-2056: My guess is that when Exiftool is available on the command line the existing [external parser is enabled|https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml] as part of the {{CompositeExternalParser}} which would get included in the {{AutoDetectParser}} and something in that chain is failing serialization. Perhaps because [ExternalParser.LineConsumer|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java#L59] is not Serializable? > Installing exiftool causes ForkParserIntegration test errors > > > Key: TIKA-2056 > URL: https://issues.apache.org/jira/browse/TIKA-2056 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.14 >Reporter: Chris A. Mattmann > > [~rgauss] maybe you can help me with this. For some reason when I was trying > your PR, I got all sorts of weird errors that I thought had to do with your > PR, but in fact, had to do with Fork Parser Integration test. [~kkrugler] > I've seen you've contributed to the Fork parser tests so tagging you on this > too. Any reason you guys can think of that exiftool causes the Fork parser > integration tests to fail? > Here's the log msg (that I thought was due to the Sentiment parser, but is in > fact not!): > {noformat} > [INFO] Changes detected - recompiling the module! > [INFO] Compiling 124 source files to > /Users/mattmann/tmp/tika1.14/tika-parsers/target/test-classes > [INFO] > /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java: > Some input files use or override a deprecated API. > [INFO] > /Users/mattmann/tmp/tika1.14/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java: > Recompile with -Xlint:deprecation for details. > [INFO] > [INFO] --- maven-surefire-plugin:2.18.1:test (default-test) @ tika-parsers --- > [INFO] Surefire report directory: > /Users/mattmann/tmp/tika1.14/tika-parsers/target/surefire-reports > --- > T E S T S > --- > Running org.apache.tika.parser.fork.ForkParserIntegrationTest > Tests run: 5, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 2.46 sec <<< > FAILURE! - in org.apache.tika.parser.fork.ForkParserIntegrationTest > testForkedTextParsing(org.apache.tika.parser.fork.ForkParserIntegrationTest) > Time elapsed: 0.185 sec <<< ERROR! > org.apache.tika.exception.TikaException: Unable to serialize AutoDetectParser > to pass to the Forked Parser > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at java.util.ArrayList.writeObject(ArrayList.java:762) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at java.util.ArrayList.writeObject(ArrayList.java:762) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method
[jira] [Commented] (TIKA-774) ExifTool Parser
[ https://issues.apache.org/jira/browse/TIKA-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209162#comment-15209162 ] Ray Gauss II commented on TIKA-774: --- bq. we should add a static check for whether exiftool is available and adjust "handled" mimes at that point. I think we'll find other areas to improve on as well, I just wanted to get the ball rolling again on the contribution and review as we had to close the source on the stand-alone project mentioned above. bq. I should have a chance to look more closely early next week, but I doubt there's reason to wait for my feedback. We'd value your feed back, and it's been over 4 years, we can wait a few more weeks. :) bq. Is this a replacement for the one I hacked together? There's the possibility for the two to coexist, perhaps requiring this parser to be explicitly called programmatically. At a high level the biggest differences are: # As mentioned in TIKA-1639, there's an extensive mapping from ExifTool's namespace to proper Tika properties (currently done programmatically) # It includes the ability embed, i.e. writing metadata back into binary files. (TIKA-776) > ExifTool Parser > --- > > Key: TIKA-774 > URL: https://issues.apache.org/jira/browse/TIKA-774 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.0 > Environment: Requires be installed > (http://www.sno.phy.queensu.ca/~phil/exiftool/) >Reporter: Ray Gauss II > Labels: features, new-parser, newbie, patch > Fix For: 1.13 > > Attachments: testJPEG_IPTC_EXT.jpg, > tika-core-exiftool-parser-patch.txt, tika-parsers-exiftool-parser-patch.txt > > > Adds an external parser that calls ExifTool to extract extended metadata > fields from images and other content types. > In the core project: > An ExifTool interface is added which contains Property objects that define > the metadata fields available. > An additional Property constructor for internalTextBag type. > In the parsers project: > An ExiftoolMetadataExtractor is added which does the work of calling ExifTool > on the command line and mapping the response to tika metadata fields. This > extractor could be called instead of or in addition to the existing > ImageMetadataExtractor and JempboxExtractor under TiffParser and/or > JpegParser but those have not been changed at this time. > An ExiftoolParser is added which calls only the ExiftoolMetadataExtractor. > An ExiftoolTikaMapper is added which is responsible for mapping the ExifTool > metadata fields to existing tika and Drew Noakes metadata fields if enabled. > An ElementRdfBagMetadataHandler is added for extracting multi-valued RDF Bag > implementations in XML files. > An ExifToolParserTest is added which tests several expected XMP and IPTC > metadata values in testJPEG_IPTC_EXT.jpg. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format
[ https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1906: --- Fix Version/s: 1.13 2.0 > ExternalParser No Longer Supports Commands in Array Format > -- > > Key: TIKA-1906 > URL: https://issues.apache.org/jira/browse/TIKA-1906 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 > Reporter: Ray Gauss II > Assignee: Ray Gauss II > Fix For: 2.0, 1.13 > > > After the changes in TIKA-1638 the ExternalParser now ignores commands > specified as a string array and assumes commands will be in a single string > with a space delimiter. > Both formats should be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format
[ https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1906. Resolution: Fixed > ExternalParser No Longer Supports Commands in Array Format > -- > > Key: TIKA-1906 > URL: https://issues.apache.org/jira/browse/TIKA-1906 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 > Reporter: Ray Gauss II > Assignee: Ray Gauss II > Fix For: 2.0, 1.13 > > > After the changes in TIKA-1638 the ExternalParser now ignores commands > specified as a string array and assumes commands will be in a single string > with a space delimiter. > Both formats should be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format
[ https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206138#comment-15206138 ] Ray Gauss II edited comment on TIKA-1906 at 3/22/16 2:37 PM: - bq. agreed, sorry must have missed that as I thought I fixed it for both per TIKA-1638. No worries. I guess I'll leave this open until the tika-2.x build is happy again. was (Author: rgauss): bq. agreed, sorry must have missed that as I thought I fixed it for both per TIKA-1638. No worries. I guess I'll leave this open until the tika-2.x is happy again. > ExternalParser No Longer Supports Commands in Array Format > -- > > Key: TIKA-1906 > URL: https://issues.apache.org/jira/browse/TIKA-1906 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 > Reporter: Ray Gauss II > Assignee: Ray Gauss II > > After the changes in TIKA-1638 the ExternalParser now ignores commands > specified as a string array and assumes commands will be in a single string > with a space delimiter. > Both formats should be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1906) ExternalParser No Longer Supports Commands in Array Format
[ https://issues.apache.org/jira/browse/TIKA-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15206138#comment-15206138 ] Ray Gauss II commented on TIKA-1906: bq. agreed, sorry must have missed that as I thought I fixed it for both per TIKA-1638. No worries. I guess I'll leave this open until the tika-2.x is happy again. > ExternalParser No Longer Supports Commands in Array Format > -- > > Key: TIKA-1906 > URL: https://issues.apache.org/jira/browse/TIKA-1906 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.9 > Reporter: Ray Gauss II > Assignee: Ray Gauss II > > After the changes in TIKA-1638 the ExternalParser now ignores commands > specified as a string array and assumes commands will be in a single string > with a space delimiter. > Both formats should be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196030#comment-15196030 ] Ray Gauss II commented on TIKA-1607: bq. It might be more easily configurable to use the ParsingEmbeddedDocExtractor as is and let users write their own XMP parsers, no? Yes, and we could do that in addition to the above, but if I'm understanding correctly that alone would still force users to write 'Tika-based' XMP parsers rather than allowing them access to the RAW XMP encoded bytes you're referring to in the last sentence, which I do agree might be helpful in some cases. So the idea for the second part would be to get the user those bytes in a way that hopefully doesn't require sweeping changes to the parsers (I'm thinking of this with an eye towards all types of embedded resources, not just XMP). The {{EmbeddedDocumentExtractor}} interface's {{parseEmbedded}} method currently takes a {{Metadata}} object which is only associated with the embedded resource (not the same metadata object associated with the 'container' file) and is populated with the embedded resource's filename, type, size, etc. Option 1. We might be able to do something like: {code} /** * Extension of {@link EmbeddedDocumentExtractor} which stores the embedded * resources during parsing for retrieval. */ public interface StoringEmbeddedDocumentExtractor extends EmbeddedDocumentExtractor { /** * Gets the map of known embedded resources or null if no resources * were stored during parsing * * @return the embedded resources */ Map<Metadata, byte[]> getEmbeddedResources(); } {code} then modify ParsingEmbeddedDocumentExtractor to implement it with an option which 'turns it on'? Option 2. Provide a separate implementation of StoringEmbeddedDocumentExtractor that users could set in the context? Option 3. Just pull {{FileEmbeddedDocumentExtractor}} out of {{TikaCLI}} and make them use temp files? Option 4. Maybe the effort is better spent on said sweeping parser changes to include some {{EmbeddedResources}} object to be optionally populated along with the {{Metadata}} in the {{Parser.parse}} method? Other options? Maybe they don't need the RAW XMP? I'm also aware that we've strayed a bit from the original issue here of structured metadata. Should we create a separate issue? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193845#comment-15193845 ] Ray Gauss II edited comment on TIKA-1607 at 3/15/16 1:57 PM: - Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedDocumentExtractor}} implementation they could use without resorting to extracting them to files? was (Author: rgauss): Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedResourceHandler}} implementation they could use without resorting to extracting them to files? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195326#comment-15195326 ] Ray Gauss II commented on TIKA-1607: Sorry, I meant {{EmbeddedDocumentExtractor}} (edited comment). We can currently dump stuff to files in some parsers with the {{--extract}} CLI option which sticks a {{FileEmbeddedDocumentExtractor}} in the context. The current default for PDF is the {{ParsingEmbeddedDocumentExtractor}}. Perhaps we could add an option to ParsingEmbeddedDocumentExtractor which, when enabled, would also save the embedded resources in memory for an advanced user to do whatever they need, knowing the risk and resources required for that option? Or provide some other in-memory implementation that advanced users could explicitly set in the context? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193845#comment-15193845 ] Ray Gauss II commented on TIKA-1607: Have we already considered treating the XMP packets more like embedded resources and making it easier for the advanced users described above to get at those resources, perhaps providing an {{EmbeddedResourceHandler}} implementation they could use without resorting to extracting them to files? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167135#comment-15167135 ] Ray Gauss II commented on TIKA-1607: I know there can be multiple XMP packets in a single file, but do we have many other examples where we'd need multiple DOMs associated with a single file? I'm trying to understand if the metadata is really the right place for this. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15154205#comment-15154205 ] Ray Gauss II commented on TIKA-1607: In my experience people gravitate towards 'other' buckets, i.e.: "I didn't know (bother to read) what the designated ones were so I just used 'other'". {{getBytes}} feels like 'other'. While people could still do really stupid things with {{getDOM}} if they wanted to, {{getBytes}} seems to encourage a developer to go ahead and try to use each frame of a 120fps 8K video as a 'metadata' value. An extreme and unlikely example of course, but you get the gist. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149231#comment-15149231 ] Ray Gauss II commented on TIKA-1607: Are we opening a can of worms by encouraging the use of a byte array directly with no restrictions on length, etc.? > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.13 > > Attachments: TIKA-1607_bytes_dom_values.patch, > TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, > TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules
[ https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130386#comment-15130386 ] Ray Gauss II commented on TIKA-1824: bq. Thank you, Bob Paulin! Again, this is fantastic. Indeed, thanks! bq. Perhaps add "parser(s?) to the artifactId, e.g. tika-parser-cad-module Now that the change is in there it seems a bit redundant to have parser and module in every artifact ID. {{tika-parser-*}} follows the least to most specific precedence and they're so perhaps we could just remove module? I had some concerns over the apparent duplication of dependencies / versions but it looks like that will be addressed in TIKA-1847. > Tika 2.0 - Create Initial Parser Modules > - > > Key: TIKA-1824 > URL: https://issues.apache.org/jira/browse/TIKA-1824 > Project: Tika > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Bob Paulin >Assignee: Bob Paulin > > Create initial break down of parser modules. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746719#comment-14746719 ] Ray Gauss II commented on TIKA-1607: Hi [~talli...@mitre.org], apologies for the delay on responding here. 1. POJOs bq. We might have better documentation of POJOs and compile-time guarantees about methods and typed values. Agreed, but the DOM persistence doesn't preclude us from also using Java 'helper' classes that know how to more easily get and set values for particular schemas that we'd like to focus on. bq. Schemas/xsds can enforce plenty, I know, but would we want to build an xsd and maintain it? I'd vote for sticking as true to a specification's original schema as possible when there is one but whether we'd want to build and maintain for those that don't is a good question. 2. Passthrough bq. why couldn't we literally pass that through via the String version of the xml? I think we could, but we'd first have to 'merge' with the metadata being modeled by the parsers and could then allow access to the full DOM {{Document}} object which clients could easily serialize to a string if need be. 3. Serialization to JSON There seem to be several libraries available that can help with XML to JSON, though I don't think this would belong in core. 4. Multilingual fields Great question. XMP uses RDF and xml:lang: {noformat} quick brown fox rapido fox marrone {noformat} that's one possibility. bq. I'm wondering if we want to add structure only where structured data doesn't exist within the document and let the client parse what they'd like out of structured metadata that is in the document? This also relates to passthrough above but one thing to keep in mind is that the metadata we're parsing could be coming from several different parts of the binary. For example, EXIF doesn't necessarily also live in XMP (though most apps also write it there these days) and there can be more than one XMP packet present in a file. It would be nice to bring these different sources into a unified persistence structure, even if for simpler metadata everything lives at the top level. bq. how do we transfer as much normalized/structured metadata as possible in as simple a way to the end user. This also gets back to passthrough and the possibility of access to the full DOM {{Document}} object. Thanks for keeping the discussion going. We obviously need to take great care in changing such a fundamental area of the code. > Introduce new arbitrary object key/values data structure for persistence of > Tika Metadata > - > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Critical > Fix For: 1.11 > > Attachments: TIKA-1607v1_rough_rough.patch, > TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap HashMap> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706706#comment-14706706 ] Ray Gauss II commented on TIKA-1607: Yes, by shoehorn I meant that the index is embedded in the key (in this case sub-group name) and that all parsers and consuming client apps must know to utilize that syntax rather than either a separate, explicit index field or a well defined structure like that of the DOM approach. Perhaps we should flesh out a solid requirements list (possibly using the [comment above|https://issues.apache.org/jira/browse/TIKA-1607?focusedCommentId=14660441page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660441] as a starting point). Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704880#comment-14704880 ] Ray Gauss II commented on TIKA-1607: I did see that, but I was after full URI namespaces, i.e. {{http://purl.org/dc/elements/1.1/}}, not just prefixes. The OODT approach looks like you'd have to shoehorn the index into the group name, much like the tika-ffmpeg workaround, rather than a more strictly defined structure. OODT might support deeper structures in the inner {{Group}} class, but the public methods appear to only support a single level? For example, How could one get to something like the value of the city of the 3rd contact's 2nd address, i.e. p1:contact[2]/p1:address[1]/p1:city? We could mimic XPath syntax but the DOM approach allows us to use {{javax.xml.xpath.XPath}} processing. From the [test mentioned above|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]: {code:java} String expression = /tika:metadata/vcard:tel[1]/vcard:uri; assertEquals(telUri, metadata.getValueByXPath(expression)); {code} The DOM approach would also allow us to leverage things like attributes to further describe a particular metadata value in the future if need be. We might also be able to pass through entire metadata structures that Tika hasn't explicitly modeled. It's certainly a larger change, but I think it gives us a lot more options. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703924#comment-14703924 ] Ray Gauss II commented on TIKA-1607: I've put together the start of the DOM metadata store option on [GitHub as well|https://github.com/apache/tika/compare/trunk...rgauss:trunk]. The crux of the change is using a {{org.w3c.dom.Document}} object instead of a {{MapString, String[]}} as the metadata store and Property objects based on {{QName}}s instead of Strings. A few things to note: * This does bring in commons-lang for XML escaping, we could change if need be * It seems mostly backwards compatible. tika-xmp is failing at the moment, but I think it's just a matter of applying the same techniques there * String-based accessors weren't deprecated, but could be if targeting Tika 2.0 * There are several TODOs that would still need to be addressed The [test added|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394] demonstrates creating a DOM structure, adding it to the metadata, then pulling it out both programmatically and via XPath expression (sticking to the telephone number example). That programmatic creation of the DOM structure is a bit cumbersome and we could certainly employ Java classes specific to each standard as a convenience (somewhat similar to [~talli...@mitre.org]'s proposal), but I do like the generic nature of the DOM store. The {{toString}} method of the metadata object after building that example is properly structured and namespaced XML: {code:xml} ?xml version=1.0 encoding=UTF-8 standalone=no? tika:metadata xmlns:tika=http://tika.apache.org/; vcard:tel xmlns:vcard=urn:ietf:params:xml:ns:vcard-4.0 vcard:parameters vcard:type vcard:textwork/vcard:text /vcard:type /vcard:parameters vcard:uritel:+1-800-555-1234/vcard:uri /vcard:tel /tika:metadata {code} There's obviously lots of room for improvement and discussion but I wanted to put it out there before the momentum on this slows. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704108#comment-14704108 ] Ray Gauss II commented on TIKA-1607: [~chrismattmann], I did. It seemed more similar to the XPath-like workaround I described with the notion of groups in the store, rather than the full-fledged DOM store proposed in the GitHub fork, i.e. I didn't see where anything was namespaced. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.11 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14660441#comment-14660441 ] Ray Gauss II commented on TIKA-1607: To clarify, the work mentioned above that uses an XPath-like syntax is only a workaround for mapping structured metadata into the current 'flat' metadata model in Tika. I fully support moving towards a structured metadata store in a 2.0 timeframe. (maybe that's now?) This is simply restating some of what's already been said, but there are many aspects to consider during that refactoring: * Moving towards properly namespacing metadata (even if, for now, our serialization of it only contains a prefix) * Backwards compatibility for simple string key/values * Enabling easy serialization to XML and JSON * Enabling easy discovery of at least top level elements * Lightweight dependencies in tika-core * Possible representation of binary data * Not re-inventing the wheel Given the above, perhaps we'd want to consider using Java DOM ({{org.w3c.dom.*}}) classes programmatically as a metadata store, appending and getting child nodes, etc. rather than hard coding POJOs for each metadata standard we want to support. I'll try to find some time to put together an example patch for that approach in the next few days. Introduce new arbitrary object key/values data structure for persistence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.10 Attachments: TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata
[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505054#comment-14505054 ] Ray Gauss II commented on TIKA-1607: We've had a few discussions on structured metadata over the years, some of which was captured in the [MetadataRoadmap Wiki page|http://wiki.apache.org/tika/MetadataRoadmap]. I'd agree that we should strive to maintain backwards compatibility for simple values. I think we should also consider serialization of the metadata store, not just in the {{Serializable}} interface sense, but perhaps being able to easily marshal the entire metadata store into JSON and XML. As [~gagravarr] points out, work has been done to express structured metadata via the existing metadata store. In that email thread you'll find reference to the external [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg]. Introduce new HashMapString, Object data structure for persitsence of Tika Metadata - Key: TIKA-1607 URL: https://issues.apache.org/jira/browse/TIKA-1607 Project: Tika Issue Type: Improvement Components: core, metadata Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Critical Fix For: 1.9 I am currently working implementing more comprehensive extraction and enhancement of the Tika support for Phone number extraction and metadata modeling. Right now we utilize the String[] multivalued support available within Tika to persist phone numbers as {code} Metadata: String: String[] Metadata: phonenumbers: number1, number2, number3, ... {code} I would like to propose we extend multi-valued support outside of the String[] paradigm by implementing a more abstract Collection of Objects such that we could consider and implement the phone number use case as follows {code} Metadata: String: Object {code} Where Object could be a CollectionHashMapString/Property, HashMapString/Property, String/Int/Long e.g. {code} Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), (LibPN-NumberType: International), (etc: etc)...), (+1292611054: LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) (etc)] {code} There are obvious backwards compatibility issues with this approach... additionally it is a fundamental change to the code Metadata API. I hope that the String, Object Mapping however is flexible enough to allow me to model Tika Metadata the way I want. Any comments folks? Thanks Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1594) Webp parsing support
[ https://issues.apache.org/jira/browse/TIKA-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484463#comment-14484463 ] Ray Gauss II commented on TIKA-1594: I'd recommend that for now we trim since {{Metadata.IMAGE_*}} properties are defined as {{Property.internalInteger}}. In the future I think we should consider changing to (or perhaps adding) more generally useful dimension properties, like {{Dimensions}} from the [additional properties of XMP|http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMPSpecificationPart2.pdf] (section 1.2.2.2) which includes a {{unit}} field. Webp parsing support Key: TIKA-1594 URL: https://issues.apache.org/jira/browse/TIKA-1594 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Jan Kronquist webp content type is correctly detected, but parsing is not supported. I noticed that metadata-extractor 2.8.0 supports webp: https://github.com/drewnoakes/metadata-extractor/issues/85 However, Tika does currently not work with this version (I tried manually overriding the dependency). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-634) Command Line Parser for Metadata Extraction
[ https://issues.apache.org/jira/browse/TIKA-634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342547#comment-14342547 ] Ray Gauss II commented on TIKA-634: --- Also see the [tika-ffmpeg project|https://github.com/AlfrescoLabs/tika-ffmpeg]. There we recently had to patch {{ExternalParser}} for some stream parsing concurrency problems which should be raised in a separate issue here shortly. Command Line Parser for Metadata Extraction --- Key: TIKA-634 URL: https://issues.apache.org/jira/browse/TIKA-634 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 0.9 Reporter: Nick Burch Assignee: Nick Burch Priority: Minor As discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/tika-dev/201104.mbox/%3calpine.deb.2.00.1104052028380.29...@urchin.earth.li%3E This issue is to track improvements in the ExternalParser support to handle metadata extraction, and probably easier configuration of an external parser too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files
[ https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273520#comment-14273520 ] Ray Gauss II commented on TIKA-1510: Yes. The only reason I haven't myself is that I've been trying to find some time to refactor the vorbis stuff per the previous [conversation|http://mail-archives.apache.org/mod_mbox/tika-dev/201408.mbox/%3calpine.deb.2.02.1408221155450.8...@urchin.earth.li%3E] with [~gagravarr]. FFMpeg installed but not parsing video files Key: TIKA-1510 URL: https://issues.apache.org/jira/browse/TIKA-1510 Project: Tika Issue Type: Bug Components: parser Environment: FFMPEG, Mac OS X 10.9 with HomeBrew Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 I have FFMPEG installed with homebrew: {noformat} # brew install ffmpeg {noformat} I've got some AVI files and have tried to parse them with Tika: {noformat} [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI Content-Length: 334917340 Content-Type: video/x-msvideo X-Parsed-By: org.apache.tika.parser.EmptyParser resourceName: SPOT11_01 17.AVI {noformat} I took a look at the ExternalParser, which is configured for using ffmpeg if it's installed. It seems it only works on: {code:xml} mime-types mime-typevideo/avi/mime-type mime-typevideo/mpeg/mime-type /mime-types {code} I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the work by [~rgauss] at Github - Ray I noticed there is no parser in that work: https://github.com/AlfrescoLabs/tika-ffmpeg But there seems to be metadata extraction code, etc. Ray should I do something with this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1510) FFMpeg installed but not parsing video files
[ https://issues.apache.org/jira/browse/TIKA-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273049#comment-14273049 ] Ray Gauss II commented on TIKA-1510: In that project there is a [{{TikaIntrinsicAVFfmpegParserFactory}}|https://github.com/AlfrescoLabs/tika-ffmpeg/blob/master/src/main/java/org/apache/tika/parser/ffmpeg/TikaIntrinsicAVFfmpegParserFactory.java] which is used to set up an {{ExternalParser}}. See the [{{TikaIntrinsicAVFfmpegParserTest}}|https://github.com/AlfrescoLabs/tika-ffmpeg/blob/master/src/test/java/org/apache/tika/parser/ffmpeg/TikaIntrinsicAVFfmpegParserTest.java] for an example of its use. FFMpeg installed but not parsing video files Key: TIKA-1510 URL: https://issues.apache.org/jira/browse/TIKA-1510 Project: Tika Issue Type: Bug Components: parser Environment: FFMPEG, Mac OS X 10.9 with HomeBrew Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 I have FFMPEG installed with homebrew: {noformat} # brew install ffmpeg {noformat} I've got some AVI files and have tried to parse them with Tika: {noformat} [chipotle:~/Desktop/drone-vids] mattmann% tika -m SPOT11_01\ 17.AVI Content-Length: 334917340 Content-Type: video/x-msvideo X-Parsed-By: org.apache.tika.parser.EmptyParser resourceName: SPOT11_01 17.AVI {noformat} I took a look at the ExternalParser, which is configured for using ffmpeg if it's installed. It seems it only works on: {code:xml} mime-types mime-typevideo/avi/mime-type mime-typevideo/mpeg/mime-type /mime-types {code} I'll add video/x-msvideo and see if that fixes it. I also stumbled upon the work by [~rgauss] at Github - Ray I noticed there is no parser in that work: https://github.com/AlfrescoLabs/tika-ffmpeg But there seems to be metadata extraction code, etc. Ray should I do something with this? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14134822#comment-14134822 ] Ray Gauss II commented on TIKA-93: -- You could use [{{org.junit.Assume}}|http://stackoverflow.com/questions/1689242/conditionally-ignoring-tests-in-junit-4] so the tests will be skipped rather than reported as passing. Perhaps we should consider the Maven Failsafe Plugin as well? OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: Petr_tika-config.xml, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, TesseractOCR_Tyler_v3.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102175#comment-14102175 ] Ray Gauss II commented on TIKA-93: -- Can you create a config object and pass that in the {{ParseContext}}, similar to what [{{PDFParser}}|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java] does with a [{{PDFParserConfig}}|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java] entry? {code} //config from context, or default if not set via context PDFParserConfig localConfig = context.get(PDFParserConfig.class, defaultConfig); {code} OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102193#comment-14102193 ] Ray Gauss II commented on TIKA-93: -- Apologies, jumped in late and only glanced at the comment thread. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.7 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1328) Translate Metadata and Content
[ https://issues.apache.org/jira/browse/TIKA-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14026783#comment-14026783 ] Ray Gauss II commented on TIKA-1328: Leaning towards the whitelist approach, perhaps we could add an {{isTranslatable}} field / method and corresponding constructor to the {{Property}} class (with a default of false) and update the properties we want to support translation on? Translate Metadata and Content -- Key: TIKA-1328 URL: https://issues.apache.org/jira/browse/TIKA-1328 Project: Tika Issue Type: New Feature Reporter: Tyler Palsulich Fix For: 1.7 Right now, Translation is only done on Strings. Ideally, users would be able to turn on translation while parsing. I can think of a couple options: - Make a TranslateAutoDetectParser. Automatically detect the file type, parse it, then translate the content. - Make a Context switch. When true, translate the content regardless of the parser used. I'm not sure the best way to go about this method, but I prefer it over another Parser. Regardless, we need a black or white list for translation. I think black list would be the way to go -- which fields should not be translated (dates, versions, ...) Any ideas? Also, somewhat unrelated, does anyone know of any other open source translation libraries? If we were really lucky, it wouldn't depend on an online service. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1320) extract text from jpeg in solr tika
[ https://issues.apache.org/jira/browse/TIKA-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14017613#comment-14017613 ] Ray Gauss II commented on TIKA-1320: I'm not sure we have enough context in the description of this issue to help much here. As [~thaichat04] points out, OCR is one way of obtaining text from an image, but there are also several forms of embedded metadata that can be extracted. Is there specific text you're looking to extract? extract text from jpeg in solr tika --- Key: TIKA-1320 URL: https://issues.apache.org/jira/browse/TIKA-1320 Project: Tika Issue Type: New Feature Reporter: muruganv Labels: features Original Estimate: 24h Remaining Estimate: 24h How to extract text from jpeg or image format or tiff in solr tika -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012393#comment-14012393 ] Ray Gauss II commented on TIKA-1294: Hi [~talli...@apache.org], The changes look good, thanks! One minor point on conventions: I think enums are typically uppercase? Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs --- Key: TIKA-1294 URL: https://issues.apache.org/jira/browse/TIKA-1294 Project: Tika Issue Type: Improvement Reporter: Tim Allison Assignee: Tim Allison Priority: Trivial Fix For: 1.6 Attachments: TIKA-1294.patch, TIKA-1294v1.patch TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature! However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images: 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value. 2) allow the client to set a parameter in the PDFConfig object. My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [DISCUSS] Centralizing JSON handling of Metadata
Hi Tim, 1) Sounds good to me. 2) I do think we want core as lean as possible, so my vote would be for a separate project/module, similar to what was done with tika-xmp. Perhaps something like tika-serialization-json to indicate other formats may follow in the same precedence? 3) Similar to above, perhaps org.apache.tika.metadata.serialization.json? Just curious, any particular reason for GSON over Jackson? Regards, Ray On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (talli...@mitre.org) wrote: All, Nick recommended I put the question to the dev list for discussion. It might be useful to centralize our json handling of Metadata. We are now currently using different libraries and doing different things in CLI and in tika-server. 1) Do we want to centralize json handling of Metadata? 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON is only 186k, but this would add potential for jar conflicts with folks integrating Tika, and it doesn't feel like a core function to me...it is a handy decorator for applications. 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations, with a slight preference for the second (oat.utils.json). Thank you! Best, Tim -Original Message- From: Nick Burch (JIRA) [mailto:j...@apache.org] Sent: Wednesday, May 28, 2014 12:41 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011287#comment-14011287 ] Nick Burch commented on TIKA-1311: -- If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal, so we might want to run the plan past the dev list first to see what people think (core tends to try to have a very minimal set of deps, unlike the other modules) Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise utils.json Centralize JSON handling of Metadata Key: TIKA-1311 URL: https://issues.apache.org/jira/browse/TIKA-1311 Project: Tika Issue Type: Task Reporter: Tim Allison Priority: Minor When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to centralize JSON handling of Metadata, potentially putting it in core. On a recent bug fix (TIKA-1291), the same recommendation was repeated especially noting that we now handle JSON/Metadata differently in CLI and server. Let's centralize JSON handling in core and use GSON. We should add a serializer and a deserializer so that users don't have to reinvent that wheel. -- This message was sent by Atlassian JIRA (v6.2#6252)
RE: [DISCUSS] Centralizing JSON handling of Metadata
I’ve used Jackson a bit but I don’t have a strong preference either. I’m generally a fan of splitting things up into very small projects to keep the dependency hierarchy as clean as possible. In this example, if we decided to do a direct serialization to, say, a Mongo DBObject in the future the json project wouldn’t need to bring in Mongo dependencies. Apache Camel does a good job of segmenting things [1]. However, that sort of modularization is probably a broader discussion than what we need for this particular issue, so between those two I’d vote for tika-serialization. Regards, Ray [1] https://git-wip-us.apache.org/repos/asf?p=camel.git;a=tree;f=components;h=1132bd1bb98a446aec97d5c7bc4d032276a65d83;hb=HEAD On May 28, 2014 at 8:42:03 PM, Allison, Timothy B. (talli...@mitre.org) wrote: Thank you, Ray! In almost reverse order, I've been using Jackson for this already, but I used GSON in TIKA-1291 because that's what CLI was already using. In GSON's favor, the jar is a bit smaller, but I have no real preference or reason to pick one over the other. I'm not a json-blackbelt (or, I guess that would be blckbelt), so I'm happy to go with either. A new compilation unit makes sense. I'm wondering if we want to be that specific? tika-serialization? Or, maybe just tika-utils? Package name looks good to me. Thanks, again! Best, Tim -Original Message- From: Ray Gauss II [mailto:ray.ga...@alfresco.com] Sent: Wednesday, May 28, 2014 3:07 PM To: dev@tika.apache.org; Allison, Timothy B. Subject: Re: [DISCUSS] Centralizing JSON handling of Metadata Hi Tim, 1) Sounds good to me. 2) I do think we want core as lean as possible, so my vote would be for a separate project/module, similar to what was done with tika-xmp. Perhaps something like tika-serialization-json to indicate other formats may follow in the same precedence? 3) Similar to above, perhaps org.apache.tika.metadata.serialization.json? Just curious, any particular reason for GSON over Jackson? Regards, Ray On May 28, 2014 at 1:32:41 PM, Allison, Timothy B. (talli...@mitre.org) wrote: All, Nick recommended I put the question to the dev list for discussion. It might be useful to centralize our json handling of Metadata. We are now currently using different libraries and doing different things in CLI and in tika-server. 1) Do we want to centralize json handling of Metadata? 2) If so, where? Core? I share Nick's hesitance to add a dependency to core. OTOH, GSON is only 186k, but this would add potential for jar conflicts with folks integrating Tika, and it doesn't feel like a core function to me...it is a handy decorator for applications. 3) Wherever it goes, what package do we want to put it in? I like Nick's recommendations, with a slight preference for the second (oat.utils.json). Thank you! Best, Tim -Original Message- From: Nick Burch (JIRA) [mailto:j...@apache.org] Sent: Wednesday, May 28, 2014 12:41 PM To: dev@tika.apache.org Subject: [jira] [Commented] (TIKA-1311) Centralize JSON handling of Metadata [ https://issues.apache.org/jira/browse/TIKA-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011287#comment-14011287 ] Nick Burch commented on TIKA-1311: -- If we put it into core, we'd need to add another dependency (to GSON) which isn't ideal, so we might want to run the plan past the dev list first to see what people think (core tends to try to have a very minimal set of deps, unlike the other modules) Package wise, org.apache.tika.metadata.json is what I'd lean towards, otherwise utils.json Centralize JSON handling of Metadata Key: TIKA-1311 URL: https://issues.apache.org/jira/browse/TIKA-1311 Project: Tika Issue Type: Task Reporter: Tim Allison Priority: Minor When json was initially added to TIKA CLI (TIKA-213), there was a recommendation to centralize JSON handling of Metadata, potentially putting it in core. On a recent bug fix (TIKA-1291), the same recommendation was repeated especially noting that we now handle JSON/Metadata differently in CLI and server. Let's centralize JSON handling in core and use GSON. We should add a serializer and a deserializer so that users don't have to reinvent that wheel. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995298#comment-13995298 ] Ray Gauss II commented on TIKA-1278: Hi [~tallison], I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults. Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people. We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful. Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued
[ https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995945#comment-13995945 ] Ray Gauss II commented on TIKA-1295: +1 for the data model more accurately reflecting the standard and for multilingual fields, but with a simple text bag how would you know which value corresponds to which language? I think this is another example that highlights the need for a more structured underlying metadata store as mentioned in section IV of the [metadata roadmap|http://wiki.apache.org/tika/MetadataRoadmap]. Make some Dublin Core items multi-valued Key: TIKA-1295 URL: https://issues.apache.org/jira/browse/TIKA-1295 Project: Tika Issue Type: Bug Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.6 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, dc:title, dc:description and dc:rights should allow multiple values because of language alternatives. Unless anyone objects in the next few days, I'll switch those to Property.toInternalTextBag() from Property.toInternalText(). I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995474#comment-13995474 ] Ray Gauss II commented on TIKA-1294: We ran into this exact issue recently and there is another method to achieve the same result without changing Tika code. In {{ParsingEmbeddedDocumentExtractor.shouldParseEmbedded}} the {{ParseContext}} is checked for a {{DocumentSelector}}. Since that extractor seems to be the only place that type is checked for (perhaps {{EmbeddedDocumentSelector}} would be a more appropriate name?) you can create one that suits your needs and set it as the document selector value in the {{ParseContext}}. In our case we created a simple {{MediaTypeDisablingDocumentSelector}} that holds a list of {{disabledMediaTypes}}. See [{{TikaGUI}}|http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java] and its {{ImageDocumentSelector}} as a general example of document selector use. Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs --- Key: TIKA-1294 URL: https://issues.apache.org/jira/browse/TIKA-1294 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial Attachments: TIKA-1294.patch TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature! However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images: 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value. 2) allow the client to set a parameter in the PDFConfig object. My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997500#comment-13997500 ] Ray Gauss II commented on TIKA-1294: I saw similar problematic resource consumption as well, which was the reason for figuring out how to disable this stuff :) Perhaps a generic indication of why this embedded object is being parsed would be useful to have in the metadata object passed to the {{EmbeddedDocumentExtractor}}, something like an {{EmbeddedObjectContext}} enum with {{INLINE}} and {{ATTACHMENT}} options, which the {{EmbeddedDocumentExtractor}} (and in most cases that means the {{DocumentSelector}}) could use to determine whether to parse on a per-object basis? Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs --- Key: TIKA-1294 URL: https://issues.apache.org/jira/browse/TIKA-1294 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial Attachments: TIKA-1294.patch TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature! However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images: 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value. 2) allow the client to set a parameter in the PDFConfig object. My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1295) Make some Dublin Core items multi-valued
[ https://issues.apache.org/jira/browse/TIKA-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13997478#comment-13997478 ] Ray Gauss II commented on TIKA-1295: bq. I see that there is an ALT PropertyType. Are there any plans to implement that (or did I miss the implementation somewhere) Not sure. On first glance I don't see it anywhere, nor any use of {{ValueType.LOCALE}}. I think we'd need a design discussion on how best to implement multilingual properties, likely through some suffixing of property keys if we don't change the underlying metadata structure, or perhaps that discussion has already taken place? Make some Dublin Core items multi-valued Key: TIKA-1295 URL: https://issues.apache.org/jira/browse/TIKA-1295 Project: Tika Issue Type: Bug Reporter: Tim Allison Assignee: Tim Allison Priority: Minor Fix For: 1.6 According to: http://www.pdfa.org/2011/08/pdfa-metadata-xmp-rdf-dublin-core, dc:title, dc:description and dc:rights should allow multiple values because of language alternatives. Unless anyone objects in the next few days, I'll switch those to Property.toInternalTextBag() from Property.toInternalText(). I'll also modify PDFParser to extract dc:rights. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
[ https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995960#comment-13995960 ] Ray Gauss II commented on TIKA-1294: bq. Can your MediaTypeDisablingDocumentSelector tell the difference between a jpeg that was attached to a PDF (basic attachment) and one that was derived from a PDXObjectImage? If by basic attachment you mean those defined in {{PDEmbeddedFilesNameTreeNode}}, then not exactly. Both {{PDF2XHTML.extractImages}} and {{PDF2XHTML.extractEmbeddedDocuments}} end up using the same {{getEmbeddedDocumentExtractor}} (a {{ParsingEmbeddedDocumentExtractor}} by default) and use the same {{DocumentSelector}} in the calls to {{extractor.shouldParseEmbedded(metadata)}}, but neither sets any special metadata keys indicating 'attached' vs 'embedded' so document selectors aren't able to explicitly distinguish. However, the {{PDXObjectImage}} resources *only* get the media type set in the metadata object while the {{PDEmbeddedFilesNameTreeNode}} resources get media type, name, and length set, so you could potentially check for their presence to distinguish. Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs --- Key: TIKA-1294 URL: https://issues.apache.org/jira/browse/TIKA-1294 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Trivial Attachments: TIKA-1294.patch TIKA-1268 added the capability to extract embedded images as regular embedded resources...a great feature! However, for some use cases, it might not be desirable to extract those types of embedded resources. I see two ways of allowing the client to choose whether or not to extract those images: 1) set a value in the metadata for the extracted images that identifies them as embedded PDXObjectImages vs regular image attachments. The client can then choose not to process embedded resources with a given metadata value. 2) allow the client to set a parameter in the PDFConfig object. My initial proposal is to go with option 2, and I'll attach a patch shortly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995298#comment-13995298 ] Ray Gauss II edited comment on TIKA-1278 at 5/12/14 5:39 PM: - Hi [~talli...@apache.org], I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults. Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people. We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful. was (Author: rgauss): Hi [~tallison], I thought about adding to {{PDFParser.properties}} but decided against it since PDFBox could change the default values or change the properties' scale or use, and if we weren't aware of that change we'd be inadvertently overriding those defaults. Similarly with {{PDFParserConfig.configure}}, PDFBox's defaults seem to work well for most people. We can certainly reconsider setting those defaults and/or adding other config if there are particular parameters people would find useful. Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
Ray Gauss II created TIKA-1278: -- Summary: Expose PDF Avg Char and Spacing Tolerance Config Params Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and it's configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1278: --- Description: {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. was: {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and it's configuration behavior. Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1278. Resolution: Fixed Resolved in r1589722. Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and it's configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1278) Expose PDF Avg Char and Spacing Tolerance Config Params
[ https://issues.apache.org/jira/browse/TIKA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13979700#comment-13979700 ] Ray Gauss II edited comment on TIKA-1278 at 4/24/14 1:31 PM: - Resolved in r1589722. The setting of {{PDF2XHTML}} params was also moved from {{PDF2XHTML.process}} to a new {{PDFParserConfig.configure}} method which should allow developers to extend {{PDFParserConfig}} for custom behavior. was (Author: rgauss): Resolved in r1589722. Expose PDF Avg Char and Spacing Tolerance Config Params --- Key: TIKA-1278 URL: https://issues.apache.org/jira/browse/TIKA-1278 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 {{PDFParserConfig}} should allow for override of PDFBox's {{averageCharTolerance}} and {{spacingTolerance}} settings as noted by a TODO comment in {{PDF2XHTML}}. Additionally, {{PDF2XHTML}}'s use of {{PDFParserConfig}} should be changed slightly to allow for extension of that config class and its configuration behavior. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (TIKA-1279) Missing return lines at output of SourceCodeParser
[ https://issues.apache.org/jira/browse/TIKA-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reopened TIKA-1279: Assignee: Hong-Thai Nguyen [~thaichat04], I believe we still have to support Java 6 and {{System.lineSeparator()}} appears to have been added in Java 7. I think {{System.getProperty(line.separator)}} would be equivalent. Missing return lines at output of SourceCodeParser -- Key: TIKA-1279 URL: https://issues.apache.org/jira/browse/TIKA-1279 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.5 Reporter: Hong-Thai Nguyen Assignee: Hong-Thai Nguyen Priority: Trivial Fix For: 1.6 xhtml output is on a single line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
[ https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1151. Resolution: Fixed Resolved in r1580887. Maven Build Should Automatically Produce test-jar Artifacts --- Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.6-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
[ https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1151: --- Fix Version/s: 1.6 Maven Build Should Automatically Produce test-jar Artifacts --- Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.6 The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.6-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
[ https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1151: --- Description: The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.5-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp was: The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.5-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-bundle - tika-core - tika-parsers - tika-server - tika-xmp Maven Build Should Automatically Produce test-jar Artifacts --- Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.5-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
[ https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1151: --- Description: The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.6-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp was: The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.5-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp Maven Build Should Automatically Produce test-jar Artifacts --- Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.6-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
[ https://issues.apache.org/jira/browse/TIKA-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13907100#comment-13907100 ] Ray Gauss II commented on TIKA-1151: This will create a few artifacts on the larger side, notably: ||Artifact||Size|| |tika-parsers-1.6-SNAPSHOT-tests.jar|33MB| |tika-server-1.6-SNAPSHOT-tests.jar|6.8MB| Not huge, but I thought I'd double check that no one has any issues with that before committing. Maven Build Should Automatically Produce test-jar Artifacts --- Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.6-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-core - tika-parsers - tika-server - tika-xmp -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: Extract thumbnail from openxml office files
Hi Hong-Thai, It’s certainly worth investigating. Several other formats can have embedded thumbnails as well so we could implement a generic thumbnail property. We could probably store as something like a Base64 encoded string, but we’d likely want to place limits on the size and may need a thumbnail internet media type field as well to assist in decoding. Unless others feel differently, I would say open a JIRA where we could start discussing the design of such a feature. Thanks! Ray On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen (hong-thai.ngu...@polyspot.com) wrote: Hi all, I want to extract thumbnail image included in Open XML office files. Apparently, we can do it by openxml4j: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx The question is : should we integrate thumbnail in default metadata list of ooxml parsing result ? Thanks Hong-Thai
[jira] [Assigned] (TIKA-1177) Add Matroska (mkv, mka) format detection
[ https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-1177: -- Assignee: Ray Gauss II Add Matroska (mkv, mka) format detection Key: TIKA-1177 URL: https://issues.apache.org/jira/browse/TIKA-1177 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.4 Reporter: Boris Naguet Assignee: Ray Gauss II Priority: Minor There's no mimetype detection for Matroska format, although it's a popular video format. Here is some code I added in my custom mimetypes to detect them: {code} mime-type type=video/x-matroska glob pattern=*.mkv / magic priority=40 match value=0x1A45DFA3934282886d6174726f736b61 type=string offset=0 / /magic /mime-type mime-type type=audio/x-matroska glob pattern=*.mka / /mime-type {code} I found the signature for the mkv on: http://www.garykessler.net/library/file_sigs.html I was not able to find it clearly for mka, but detection by filename is still useful. Although, the full spec is available here: http://matroska.org/technical/specs/index.html Maybe it's a bit more complex than this constant magic, but it works on my tests files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1177) Add Matroska (mkv, mka) format detection
[ https://issues.apache.org/jira/browse/TIKA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1177. Resolution: Fixed Fix Version/s: 1.5 Unfortunately that magic doesn't seem to be required in all MKV files. I tired several utilities to convert various sources to MKV and none contained that magic. A magic value of {{0x1A45DFA3}} is present, but that's also present in WebM which is extended from Matroska. I've added Matroska mime-types based on just extension for now and also added the WebM mime-type. We can open other issues, linked to this one, for data detection of MKV and WebM files if need be. Resolved in r1529260. Add Matroska (mkv, mka) format detection Key: TIKA-1177 URL: https://issues.apache.org/jira/browse/TIKA-1177 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.4 Reporter: Boris Naguet Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 There's no mimetype detection for Matroska format, although it's a popular video format. Here is some code I added in my custom mimetypes to detect them: {code} mime-type type=video/x-matroska glob pattern=*.mkv / magic priority=40 match value=0x1A45DFA3934282886d6174726f736b61 type=string offset=0 / /magic /mime-type mime-type type=audio/x-matroska glob pattern=*.mka / /mime-type {code} I found the signature for the mkv on: http://www.garykessler.net/library/file_sigs.html I was not able to find it clearly for mka, but detection by filename is still useful. Although, the full spec is available here: http://matroska.org/technical/specs/index.html Maybe it's a bit more complex than this constant magic, but it works on my tests files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (TIKA-1179) A corrupt mp3 file can cause an infinite loop in Mp3Parser
[ https://issues.apache.org/jira/browse/TIKA-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1179. Resolution: Cannot Reproduce Assignee: Ray Gauss II I've just confirmed the described behavior in Tika 1.4, however, it appears the file is parsed just fine in 1.5! You can verify by downloading a 1.5 snapshot of {{tika-app}} ([current link|https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-app/1.5-SNAPSHOT/tika-app-1.5-20130927.201341-30.jar]), running the app, i.e.: {code} java -jar tika-app-1.5-20130927.201341-30.jar {code} and dropping {{corrupt.mp3}} onto the app window. A corrupt mp3 file can cause an infinite loop in Mp3Parser -- Key: TIKA-1179 URL: https://issues.apache.org/jira/browse/TIKA-1179 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Marius Dumitru Florea Assignee: Ray Gauss II Fix For: 1.5 Attachments: corrupt.mp3 I have a thread that indexes (among other things) files using Apache Sorl. This thread hangs (still running but with no progress) when trying to extract meta data from the mp3 file attached to this issue. Here are a couple of thread dumps taken at various moments: {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at org.apache.commons.io.input.AutoCloseInputStream.close(AutoCloseInputStream.java:63) at org.apache.commons.io.input.AutoCloseInputStream.afterRead(AutoCloseInputStream.java:77) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb7094e8 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4618000] java.lang.Thread.State: RUNNABLE at org.apache.tika.io.TailStream.skip(TailStream.java:133) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.Tika.parseToString(Tika.java:380) ... {noformat} {noformat} XWiki Solr index thread daemon prio=10 tid=0x03b72800 nid=0x64b5 runnable [0x7f46f4617000] java.lang.Thread.State: RUNNABLE at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) - locked 0xcb1be170 (a java.io.BufferedInputStream) at org.apache.tika.io.ProxyInputStream.read(ProxyInputStream.java:99) at java.io.FilterInputStream.read(Unknown Source) at org.apache.tika.io.TailStream.read(TailStream.java:117) at org.apache.tika.io.TailStream.skip(TailStream.java:140) at org.apache.tika.parser.mp3.MpegStream.skipStream(MpegStream.java:283) at org.apache.tika.parser.mp3.MpegStream.skipFrame(MpegStream.java:160) at org.apache.tika.parser.mp3.Mp3Parser.getAllTagHandlers(Mp3Parser.java:193) at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:71) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
[jira] [Assigned] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-1170: -- Assignee: Ray Gauss II Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1170. Resolution: Fixed Fix Version/s: 1.5 Added in r1519664. Thanks! Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1375#comment-1375 ] Ray Gauss II commented on TIKA-1170: My mistake, that's an artifact of me manually applying the git patch. It does, however, seem to indicate that we should have a unit test for the false positives. Do you have a file which demonstrates that problem? Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reopened TIKA-1170: Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1170. Resolution: Fixed Resolved in r1519792. SVN did not like the html extension on the problem file. Thanks again. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files
[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757000#comment-13757000 ] Ray Gauss II commented on TIKA-1170: Yes, but in this particular case I thought it might be better to explicitly change the file name so other developers don't fix the media type for that file in the future. Insufficiently specific magic for binary image/cgm files Key: TIKA-1170 URL: https://issues.apache.org/jira/browse/TIKA-1170 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Assignee: Ray Gauss II Priority: Minor Fix For: 1.5 Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 0002-Added-example-malformed-HTML-file-that-was-being-mis.patch, plotutils-example.cgm I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0/ {code} The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of 300 million resources, but it would be nice if this could be tightened up. Looking at the PRONOM signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1048strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1049strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1050strPageToDisplay=signatures * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReportid=1051strPageToDisplay=signatures it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be: {code} match value=BEGMF type=string offset=0/ match value=0x0020 mask=0xffe0 type=string offset=0 match value=0x10220001 type=string offset=2:64/ match value=0x10220002 type=string offset=2:64/ match value=0x10220003 type=string offset=2:64/ match value=0x10220004 type=string offset=2:64/ /match {code} Where I have assumed the filename part of the CGM file will be less that 64 characters long. Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1166) FLVParser NullPointerException
[ https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-1166: -- Assignee: Ray Gauss II FLVParser NullPointerException -- Key: TIKA-1166 URL: https://issues.apache.org/jira/browse/TIKA-1166 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.1, 1.2, 1.3, 1.4 Environment: All Reporter: david rapin Assignee: Ray Gauss II Labels: easyfix Attachments: data.mp4 Original Estimate: 10m Remaining Estimate: 10m On certain video files, the FLV parser throws an NPE on line 242. The piece of code causing this is the following: https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242 {noformat}241: for (EntryString, Object entry : extractedMetadata.entrySet()) { 242: metadata.set(entry.getKey(), entry.getValue().toString()); 243: } {noformat} Which should probably be replaced by something like this: {noformat}241: for (EntryString, Object entry : extractedMetadata.entrySet()) { 242: if (entry.getValue() == null) continue; 243: metadata.set(entry.getKey(), entry.getValue().toString()); 244: } {noformat} Exception trace : {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4 Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.video.FLVParser@58d9660d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1166) FLVParser NullPointerException
[ https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1166. Resolution: Fixed Fix Version/s: 1.5 I briefly tried a few methods of trimming the problem file's size but none reproduced the issue in the resulting file. Committed a check for null in r1518318. FLVParser NullPointerException -- Key: TIKA-1166 URL: https://issues.apache.org/jira/browse/TIKA-1166 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.1, 1.2, 1.3, 1.4 Environment: All Reporter: david rapin Assignee: Ray Gauss II Labels: easyfix Fix For: 1.5 Attachments: data.mp4 Original Estimate: 10m Remaining Estimate: 10m On certain video files, the FLV parser throws an NPE on line 242. The piece of code causing this is the following: https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242 {noformat}241: for (EntryString, Object entry : extractedMetadata.entrySet()) { 242: metadata.set(entry.getKey(), entry.getValue().toString()); 243: } {noformat} Which should probably be replaced by something like this: {noformat}241: for (EntryString, Object entry : extractedMetadata.entrySet()) { 242: if (entry.getValue() == null) continue; 243: metadata.set(entry.getKey(), entry.getValue().toString()); 244: } {noformat} Exception trace : {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4 Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.video.FLVParser@58d9660d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1166) FLVParser NullPointerException
[ https://issues.apache.org/jira/browse/TIKA-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747529#comment-13747529 ] Ray Gauss II commented on TIKA-1166: Thanks. Is there any chance you could get that down to under, say, 50k, while still demonstrating the failure so that we can include it in the dist and create a unit test against it? FLVParser NullPointerException -- Key: TIKA-1166 URL: https://issues.apache.org/jira/browse/TIKA-1166 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.1, 1.2, 1.3, 1.4 Environment: All Reporter: david rapin Labels: easyfix Attachments: data.mp4 Original Estimate: 10m Remaining Estimate: 10m On certain video files, the FLV parser throws an NPE on line 242. The piece of code causing this is the following: https://github.com/apache/tika/blob/1.4/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java#L242 {noformat}241: for (EntryString, Object entry : extractedMetadata.entrySet()) { 242: metadata.set(entry.getKey(), entry.getValue().toString()); 243: } {noformat} Which should probably be replaced by something like this: {noformat}241: for (EntryString, Object entry : extractedMetadata.entrySet()) { 242: if (entry.getValue() == null) continue; 243: metadata.set(entry.getKey(), entry.getValue().toString()); 244: } {noformat} Exception trace : {noformat}[root@hermes backend]# java -jar bin/tika-app-1.1.jar -j ./data.mp4 Exception in thread main org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.video.FLVParser@58d9660d at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:130) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:397) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:101) Caused by: java.lang.NullPointerException at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ... 5 more {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1154) Tika hangs on format detection of malformed HTML file.
[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720694#comment-13720694 ] Ray Gauss II commented on TIKA-1154: I've been pushing the metadata-extractor Maven release through Sonatype thus far, but Mr. Noakes has been granted access there [1]. If there's no response to your Google code issue I can push a 2.6.2.1 release that upgrades xercesImpl to 2.11.0 which, on first look, compiles and has no test failures. [1] https://issues.sonatype.org/browse/OSSRH-3948 Tika hangs on format detection of malformed HTML file. -- Key: TIKA-1154 URL: https://issues.apache.org/jira/browse/TIKA-1154 Project: Tika Issue Type: Bug Components: mime Affects Versions: 1.4 Reporter: Andrew Jackson Priority: Minor Attachments: tika-breaker.html We are using Tika on large web archives, which also happen to contain some malformed files. In particular, we found a HTML file with binary characters in the DOCTYPE declaration. This hangs Tika, either embedded or from the command line, during format detection. An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Tika Core and Parsers Test Artifacts
Hi Ken, Yes, by other tika projects I meant tika-app, tika-bundle, tika-xmp, etc., and yes each sub-project would end up with it's own test-jar. It probably makes more sense to just add the plugin to each project individually. Since there's been no opposition to the concept in general I'll create a JIRA issue where we can discuss the details. Regards, Ray On Jul 21, 2013, at 3:25 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Ray, On Jul 18, 2013, at 6:37am, Ray Gauss II wrote: Hi Ken, They recommend typetest-jar/type instead of classifier now [1], but yes. Thanks for the reference. Perhaps the other tika projects could benefit from this as well and it could just go into tika-parent's build plugins. By other tika projects do you mean things like tika-app? And if it's in the tika-parent's build plugins, does that mean each sub-project would wind up with its own corresponding test-jar? Thanks, -- Ken [1] http://maven.apache.org/guides/mini/guide-attached-tests.html On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Ray, On Jul 18, 2013, at 5:14am, Ray Gauss II wrote: I don't recall if we've discussed this already (I did do a brief search and didn't see anything). Is there any opposition to adding test-jar Maven artifacts for tika-core and tika-parsers? Seems like it would be good to allow others to extend from tests there if need be. +1 I assume you're talking about adding a tika-(core|parsers)-version-tests.jar, so that we'd pull it in via: dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.4/version classifiertests/classifier scopetest/scope /dependency -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
Ray Gauss II created TIKA-1151: -- Summary: Maven Build Should Automatically Produce test-jar Artifacts Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.5-SNAPSHOT/version typetest-jar/type scopetest/scope /dependency {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-bundle - tika-core - tika-parsers - tika-server - tika-xmp -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Tika Core and Parsers Test Artifacts
I don't recall if we've discussed this already (I did do a brief search and didn't see anything). Is there any opposition to adding test-jar Maven artifacts for tika-core and tika-parsers? Seems like it would be good to allow others to extend from tests there if need be.
Re: Tika Core and Parsers Test Artifacts
Hi Ken, They recommend typetest-jar/type instead of classifier now [1], but yes. Perhaps the other tika projects could benefit from this as well and it could just go into tika-parent's build plugins. Regards, Ray [1] http://maven.apache.org/guides/mini/guide-attached-tests.html On Jul 18, 2013, at 9:19 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Ray, On Jul 18, 2013, at 5:14am, Ray Gauss II wrote: I don't recall if we've discussed this already (I did do a brief search and didn't see anything). Is there any opposition to adding test-jar Maven artifacts for tika-core and tika-parsers? Seems like it would be good to allow others to extend from tests there if need be. +1 I assume you're talking about adding a tika-(core|parsers)-version-tests.jar, so that we'd pull it in via: dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.4/version classifiertests/classifier scopetest/scope /dependency -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
[jira] [Created] (TIKA-1147) Passing a File-Based TikaInputStream to ExternalEmbedder Delete
Ray Gauss II created TIKA-1147: -- Summary: Passing a File-Based TikaInputStream to ExternalEmbedder Delete Key: TIKA-1147 URL: https://issues.apache.org/jira/browse/TIKA-1147 Project: Tika Issue Type: Bug Reporter: Ray Gauss II -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
[ https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1147: --- Component/s: metadata Description: When an application using Tika passes {{InputStream}} objects to {{ExternalEmbedder.embed}} the stream is usually read into a temporary file which is then deleted after embedding takes place. However, if the application passes in a file-based {{TikaInputStream}} the embedder ends up dealing with directly with the original source file, which is then deleted after embedding takes place. Priority: Critical (was: Major) Affects Version/s: 1.4 Assignee: Ray Gauss II Summary: File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed (was: Passing a File-Based TikaInputStream to ExternalEmbedder Delete) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed - Key: TIKA-1147 URL: https://issues.apache.org/jira/browse/TIKA-1147 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.4 Reporter: Ray Gauss II Assignee: Ray Gauss II Priority: Critical When an application using Tika passes {{InputStream}} objects to {{ExternalEmbedder.embed}} the stream is usually read into a temporary file which is then deleted after embedding takes place. However, if the application passes in a file-based {{TikaInputStream}} the embedder ends up dealing with directly with the original source file, which is then deleted after embedding takes place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1147) File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed
[ https://issues.apache.org/jira/browse/TIKA-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1147. Resolution: Fixed Fix Version/s: 1.5 Resolved in r1504302. File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed - Key: TIKA-1147 URL: https://issues.apache.org/jira/browse/TIKA-1147 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.4 Reporter: Ray Gauss II Assignee: Ray Gauss II Priority: Critical Fix For: 1.5 When an application using Tika passes {{InputStream}} objects to {{ExternalEmbedder.embed}} the stream is usually read into a temporary file which is then deleted after embedding takes place. However, if the application passes in a file-based {{TikaInputStream}} the embedder ends up dealing with directly with the original source file, which is then deleted after embedding takes place. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: RFC822Parser build error on gump
I know very little about gump, but looking at the log the build seems to have skipped the mime4j artifacts altogether. On Jun 25, 2013, at 6:25 PM, Nick Burch apa...@gagravarr.org wrote: Hi All Anyone have any idea about this compiler error on the tika parsers project as hit by gump? http://vmgump.apache.org/gump/public/tika/tika-parsers/gump_work/build_tika_tika-parsers.html Gump notifications will hopefully start again soon, which'd let us find out about breaking changes from upstream Apache projects in advance, so it'd be good to get the build working ready! Nick
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682644#comment-13682644 ] Ray Gauss II commented on TIKA-1130: I've created a unit test that reproduces the issue with a stripped down version of the original file. Shall I comment out the actual test and commit? .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1130) .docx text extract leaves out some portions of text
[ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13682924#comment-13682924 ] Ray Gauss II commented on TIKA-1130: Test file and method committed in r1492909. This was just added onto {{OOXMLParserTest}} and named with a {{disabled}} prefix rather than using {{@Ignore}}. I think we should start moving towards that for new test classes though. .docx text extract leaves out some portions of text --- Key: TIKA-1130 URL: https://issues.apache.org/jira/browse/TIKA-1130 Project: Tika Issue Type: Bug Affects Versions: 1.2, 1.3 Environment: OpenJDK x86_64 Reporter: Daniel Gibby Priority: Critical Attachments: Resume 6.4.13.docx When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document), certain portions of text remain unextracted. I have attached a .docx file that can be tested against. The 'gray' portions of text are what are not extracted, while the darker colored text extracts fine. Looking at the document.xml portion of the .docx zip file shows the text is all there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition
[ https://issues.apache.org/jira/browse/TIKA-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1135. Resolution: Fixed Resolved in r1491935. Incorrect Cardinality and Case in IPTC Metadata Definition -- Key: TIKA-1135 URL: https://issues.apache.org/jira/browse/TIKA-1135 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.3 Reporter: Ray Gauss II Assignee: Ray Gauss II Priority: Minor Fix For: 1.4 Some of the fields defined in the {{IPTC}} interface have incorrect cardinality and metadata key names with incorrect case. The change of key names should be done though composite properties which include deprecated versions of the incorrect names as secondary properties for backwards compatibility. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1135) Incorrect Cardinality and Case in IPTC Metadata Definition
Ray Gauss II created TIKA-1135: -- Summary: Incorrect Cardinality and Case in IPTC Metadata Definition Key: TIKA-1135 URL: https://issues.apache.org/jira/browse/TIKA-1135 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.3 Reporter: Ray Gauss II Assignee: Ray Gauss II Priority: Minor Fix For: 1.4 Some of the fields defined in the {{IPTC}} interface have incorrect cardinality and metadata key names with incorrect case. The change of key names should be done though composite properties which include deprecated versions of the incorrect names as secondary properties for backwards compatibility. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements
Ray Gauss II created TIKA-1133: -- Summary: Ability to Allow Empty and Duplicate Tika Values for XML Elements Key: TIKA-1133 URL: https://issues.apache.org/jira/browse/TIKA-1133 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Assignee: Ray Gauss II In some cases it is beneficial to allow empty and duplicate Tika metadata values for multi-valued XML elements like RDF bags. Consider an example where the original source metadata is structured something like: {code} Person FirstNameJohn/FirstName LastNameSmith/FirstName /Person Person FirstNameJane/FirstName LastNameDoe/FirstName /Person Person FirstNameBob/FirstName /Person Person FirstNameKate/FirstName LastNameSmith/FirstName /Person {code} and since Tika stores only flat metadata we transform that before invoking a parser to something like: {code} custom:FirstName rdf:Bag rdf:liJohn/rdf:li rdf:liJane/rdf:li rdf:liBob/rdf:li rdf:liKate/rdf:li /rdf:Bag /custom:FirstName custom:LastName rdf:Bag rdf:liSmith/rdf:li rdf:liDoe/rdf:li rdf:li/rdf:li rdf:liSmith/rdf:li /rdf:Bag /custom:LastName {code} The current behavior ignores empties and duplicates and we don't know if Bob or Kate ever had last names. Empties or duplicates in other positions result in an incorrect mapping of data. We should allow the option to create an {{ElementMetadataHandler}} which allows empty and/or duplicate values. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1133) Ability to Allow Empty and Duplicate Tika Values for XML Elements
[ https://issues.apache.org/jira/browse/TIKA-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1133. Resolution: Fixed Fix Version/s: 1.4 Resolved in r1491680. Ability to Allow Empty and Duplicate Tika Values for XML Elements - Key: TIKA-1133 URL: https://issues.apache.org/jira/browse/TIKA-1133 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Assignee: Ray Gauss II Fix For: 1.4 In some cases it is beneficial to allow empty and duplicate Tika metadata values for multi-valued XML elements like RDF bags. Consider an example where the original source metadata is structured something like: {code} Person FirstNameJohn/FirstName LastNameSmith/FirstName /Person Person FirstNameJane/FirstName LastNameDoe/FirstName /Person Person FirstNameBob/FirstName /Person Person FirstNameKate/FirstName LastNameSmith/FirstName /Person {code} and since Tika stores only flat metadata we transform that before invoking a parser to something like: {code} custom:FirstName rdf:Bag rdf:liJohn/rdf:li rdf:liJane/rdf:li rdf:liBob/rdf:li rdf:liKate/rdf:li /rdf:Bag /custom:FirstName custom:LastName rdf:Bag rdf:liSmith/rdf:li rdf:liDoe/rdf:li rdf:li/rdf:li rdf:liSmith/rdf:li /rdf:Bag /custom:LastName {code} The current behavior ignores empties and duplicates and we don't know if Bob or Kate ever had last names. Empties or duplicates in other positions result in an incorrect mapping of data. We should allow the option to create an {{ElementMetadataHandler}} which allows empty and/or duplicate values. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: MP4Parser triggers .... something betwwen an exception and endDocument() from the Contenthandlers point of view?
I think the Parser interface Javadoc would make sense as a place to document, but I don't know if there is an existing policy. We'll certainly need to consider things like DelegatingParsers which may be using other parsers to do portions of the work. Not the principle comment you were looking for, but my 2 cents. Ray On Jun 7, 2013, at 7:30 AM, Christian Reuschling reuschl...@dfki.uni-kl.de wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 it would be very interesting if somebody has a principle comment on this thread... On 29.05.2013 14:42, Nick Burch wrote: On Wed, 29 May 2013, Christian Reuschling wrote: Nevertheless, in this case an Exception (like in all other parsers) or a tika body with length zero, which is indicated at least by handler.endDocument() would be the appropriate way, isn't it? - From the ContentHandlers point of view, there is nothing in between. I'm not sure if we do have a properly documented policy on what a parser should do if it receives a file it can't handle. For ones that are invalid (eg corrupt), I believe an exception is the expected result. The case when the file seems valid, but can't be handled by the parser, not sure Does anyone know if we have a policy on this, and/or where we should document it? Nick - -- __ Christian Reuschling, Dipl.-Ing.(BA) Software Engineer Knowledge Management Department German Research Center for Artificial Intelligence DFKI GmbH Trippstadter Straße 122, D-67663 Kaiserslautern, Germany Phone: +49.631.20575-1250 mailto:reuschl...@dfki.de http://www.dfki.uni-kl.de/~reuschling/ - Legal Company Information Required by German Law-- Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313= __ -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlGxxFkACgkQ6EqMXq+WZg91CgCffJoxohycTUP0F2ha9djqAQbp tRAAoIbAkUjqZujYM/BHINMmbhNswir9 =a1xL -END PGP SIGNATURE-
[jira] [Assigned] (TIKA-1115) ExifHandler throws NullPointerException
[ https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-1115: -- Assignee: Ray Gauss II ExifHandler throws NullPointerException --- Key: TIKA-1115 URL: https://issues.apache.org/jira/browse/TIKA-1115 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.3 Environment: verified on Mac OSX and Ubuntu 12.04 Reporter: Lee Graber Assignee: Ray Gauss II Labels: ImageMetadataExtractor Attachments: 654000main_transit-hubble-orig_full.jpg Original Estimate: 2h Remaining Estimate: 2h Notice that in the second if block, there is no check for null on the retrived datetime. I have hit this with a file which apparently has null for this value. Seems like the fix is trivial public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) { Date datetime = directory.getDate(ExifIFD0Directory.TAG_DATETIME); String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(datetime); metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone); // If Date/Time Original does not exist this might be creation date if (metadata.get(TikaCoreProperties.CREATED) == null) { metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1115) ExifHandler throws NullPointerException
[ https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646709#comment-13646709 ] Ray Gauss II commented on TIKA-1115: Hi Lee, Do we have permission to include the problem file at a greatly reduced size, say 64px wide, as a test file? ExifHandler throws NullPointerException --- Key: TIKA-1115 URL: https://issues.apache.org/jira/browse/TIKA-1115 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.3 Environment: verified on Mac OSX and Ubuntu 12.04 Reporter: Lee Graber Assignee: Ray Gauss II Labels: ImageMetadataExtractor Attachments: 654000main_transit-hubble-orig_full.jpg Original Estimate: 2h Remaining Estimate: 2h Notice that in the second if block, there is no check for null on the retrived datetime. I have hit this with a file which apparently has null for this value. Seems like the fix is trivial public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) { Date datetime = directory.getDate(ExifIFD0Directory.TAG_DATETIME); String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(datetime); metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone); // If Date/Time Original does not exist this might be creation date if (metadata.get(TikaCoreProperties.CREATED) == null) { metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1115) ExifHandler throws NullPointerException
[ https://issues.apache.org/jira/browse/TIKA-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1115. Resolution: Fixed Fix Version/s: 1.4 Resolved in r1478111 ExifHandler throws NullPointerException --- Key: TIKA-1115 URL: https://issues.apache.org/jira/browse/TIKA-1115 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.3 Environment: verified on Mac OSX and Ubuntu 12.04 Reporter: Lee Graber Assignee: Ray Gauss II Labels: ImageMetadataExtractor Fix For: 1.4 Attachments: 654000main_transit-hubble-orig_full.jpg Original Estimate: 2h Remaining Estimate: 2h Notice that in the second if block, there is no check for null on the retrived datetime. I have hit this with a file which apparently has null for this value. Seems like the fix is trivial public void handleDateTags(Directory directory, Metadata metadata) throws MetadataException { // Date/Time Original overrides value from ExifDirectory.TAG_DATETIME Date original = null; if (directory.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)) { original = directory.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL); // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } } if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) { Date datetime = directory.getDate(ExifIFD0Directory.TAG_DATETIME); String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(datetime); metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone); // If Date/Time Original does not exist this might be creation date if (metadata.get(TikaCoreProperties.CREATED) == null) { metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Build failed in Jenkins: Tika-trunk #994
Looks like a possible build server problem. Does anyone have access to manually trigger another build? Regards, Ray On May 1, 2013, at 5:01 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: See https://builds.apache.org/job/Tika-trunk/994/changes
Re: Build failed in Jenkins: Tika-trunk #994
Subject: Jenkins build is back to normal : Tika-trunk #995 Yay, thanks! On May 1, 2013, at 5:24 PM, Michael McCandless luc...@mikemccandless.com wrote: I just kicked off another build ... (it's queued). Mike McCandless http://blog.mikemccandless.com On Wed, May 1, 2013 at 5:12 PM, Ray Gauss II ray.ga...@alfresco.com wrote: Looks like a possible build server problem. Does anyone have access to manually trigger another build? Regards, Ray On May 1, 2013, at 5:01 PM, Apache Jenkins Server jenk...@builds.apache.org wrote: See https://builds.apache.org/job/Tika-trunk/994/changes
[jira] [Commented] (TIKA-1074) Extraction should continue if an exception is hit visiting an embedded document
[ https://issues.apache.org/jira/browse/TIKA-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13584194#comment-13584194 ] Ray Gauss II commented on TIKA-1074: bq. But it's a little weird throw TikaExc in response to an interrupt (ie, code above will be trying to catch an IE) ... I think it's cleaner to set the interrupt bit and let the next place that waits see the interrupt bit and throw IE? That's what I found in my investigation for TIKA-775 / TIKA-1059 as well. Extraction should continue if an exception is hit visiting an embedded document --- Key: TIKA-1074 URL: https://issues.apache.org/jira/browse/TIKA-1074 Project: Tika Issue Type: Improvement Components: parser Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 1.4 Attachments: TIKA-1074.patch, TIKA-1074.patch Spinoff from TIKA-1072. In that issue, a problematic document (still not sure if document is corrupt, or possible POI bug) caused an exception when visiting the embedded documents. If I change Tika to suppress that exception, the rest of the document extracts fine. So somehow I think we should be more robust here, and maybe log the exception, or save/record the exception(s) somewhere so after parsing the app could decide what to do about them ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1068) Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data
[ https://issues.apache.org/jira/browse/TIKA-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13566693#comment-13566693 ] Ray Gauss II commented on TIKA-1068: I can't reproduce this using tika-app from either the download distribution or compiled from source. We're using the 2.6.2 metadata-extractor jar from Maven central repository [1]. I'm not sure how your build is structured but perhaps you're including a 2.6.2 metadata-extractor jar you've downloaded from elsewhere? If so, can you try replacing that with the one on Maven central? [1] http://search.maven.org/#artifactdetails%7Ccom.drewnoakes%7Cmetadata-extractor%7C2.6.2%7Cjar Metadata-extractor throws NoSuchMethodError for jpg image with xmp header data -- Key: TIKA-1068 URL: https://issues.apache.org/jira/browse/TIKA-1068 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Magnus Lövgren Priority: Critical Attachments: vinter080501-66.jpg Using Tika 1.3, parsing of jpg files throws NoSuchMethodError when the jpg contains xmp data. No Error was thrown in Tika 1.2. The metadata-extractor was updated in Tika 1.3 (to com.drewnoakes:metadata-extractor:2.6.2), See TIKA-811 (duplicated by TIKA-996). That jar is badly compiled (as mentioned by Emmanuel Hugonnet as comment on TIKA-915) and causes the NoSuchMethodError! = the metadata-extractor 2.6.2 jar needs to be replaced! Problem seems fixed in metadata-extractor 2.7.0, but that isn't released yet. Discussions available at: http://code.google.com/p/metadata-extractor/issues/detail?id=39 http://code.google.com/p/metadata-extractor/issues/detail?id=55 Code to reproduce problem: = dependency groupIdorg.apache.tika/groupId artifactIdtika-core/artifactId version1.3/version /dependency dependency groupIdorg.apache.tika/groupId artifactIdtika-xmp/artifactId version1.3/version /dependency dependency groupIdorg.apache.tika/groupId artifactIdtika-parsers/artifactId version1.3/version /dependency InputStream inputStream = ... // vinter080501-66.jpg file (attached) ContentHandler contentHandler = new BodyContentHandler(200); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(inputStream, contentHandler, metadata, context); // Throws NoSuchMethodError = java.lang.NoSuchMethodError: com.adobe.xmp.properties.XMPPropertyInfo.getValue()Ljava/lang/Object; at com.drew.metadata.xmp.XmpReader.extract(Unknown Source) at com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(Unknown Source) at com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(Unknown Source) at org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91) at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Apache Tika 1.3 Release Candidate #1
Built on OS X, updated tika-exiftool to depend on 1.3 which compiled and passed tests. +1 for release! Cheers, Ray On Jan 18, 2013, at 11:30 PM, Dave Meikle loo...@gmail.com wrote: Hi Guys, A candidate for the Tika 1.3 release is available at: http://people.apache.org/~dmeikle/apache-tika-1.3-rc1/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/tika-1.3/ The SHA1 checksum of the archive is a80e45d1976e655381d6e93b50b9c7b118e9d6fc. A staged M2 repository can also be found on repository.apache.org here: https://repository.apache.org/content/repositories/orgapachetika-147/ Please vote on releasing this package as Apache Tika 1.3. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. [ ] +1 Release this package as Apache Tika 1.3 [ ] -1 Do not release this package because... Here is my +1 for the release. Cheers, Dave
[jira] [Created] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
Ray Gauss II created TIKA-1059: -- Summary: Better Handling of InterruptedException in ExternalParser and ExternalEmbedder Key: TIKA-1059 URL: https://issues.apache.org/jira/browse/TIKA-1059 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Fix For: 1.4 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch {{InterruptedException}} and ignore it. The methods should either call {{interrupt()}} on the current thread or re-throw the exception, possibly wrapped in a {{TikaException}}. See TIKA-775 for a previous discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-775) Embed Capabilities
[ https://issues.apache.org/jira/browse/TIKA-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-775. --- Resolution: Fixed Fix Version/s: (was: 1.4) 1.3 Assignee: Ray Gauss II Embed Capabilities -- Key: TIKA-775 URL: https://issues.apache.org/jira/browse/TIKA-775 Project: Tika Issue Type: Improvement Components: general, metadata Affects Versions: 1.0 Environment: The default ExternalEmbedder requires that sed be installed. Reporter: Ray Gauss II Assignee: Ray Gauss II Labels: embed, patch Fix For: 1.3 Attachments: embed_20121029.diff, embed.diff, tika-core-embed-patch.txt, tika-parsers-embed-patch.txt This patch defines and implements the concept of embedding tika metadata into a file stream, the reverse of extraction. In the tika-core project an interface defining an Embedder and a generic sed ExternalEmbedder implementation meant to be extended or configured are added. These classes are essentially a reverse flow of the existing Parser and ExternalParser classes. In the tika-parsers project an ExternalEmbedderTest unit test is added which uses the default ExternalEmbedder (calls sed) to embed a value placed in Metadata.DESCRIPTION then verify the operation by parsing the resulting stream. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
[ https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II updated TIKA-1059: --- Issue Type: Improvement (was: Bug) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder -- Key: TIKA-1059 URL: https://issues.apache.org/jira/browse/TIKA-1059 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Fix For: 1.4 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch {{InterruptedException}} and ignore it. The methods should either call {{interrupt()}} on the current thread or re-throw the exception, possibly wrapped in a {{TikaException}}. See TIKA-775 for a previous discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (TIKA-1056) unify ImageMetadataExtractor interface
[ https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reassigned TIKA-1056: -- Assignee: Ray Gauss II unify ImageMetadataExtractor interface -- Key: TIKA-1056 URL: https://issues.apache.org/jira/browse/TIKA-1056 Project: Tika Issue Type: Wish Reporter: Maciej Lizewski Assignee: Ray Gauss II Priority: Trivial there are several methods in this class that are targeted for different image type but with different visibility: public void parseJpeg(File file); protected void parseTiff(InputStream stream); both simply extract all possible metadata from image file or stream. Would be nice if parseTiff could also be public so it will be easier to create custom parsers located in external jars that use this functionality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-1056) unify ImageMetadataExtractor interface
[ https://issues.apache.org/jira/browse/TIKA-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-1056. Resolution: Fixed Fix Version/s: 1.3 Resolved in r1434117. unify ImageMetadataExtractor interface -- Key: TIKA-1056 URL: https://issues.apache.org/jira/browse/TIKA-1056 Project: Tika Issue Type: Wish Reporter: Maciej Lizewski Assignee: Ray Gauss II Priority: Trivial Fix For: 1.3 there are several methods in this class that are targeted for different image type but with different visibility: public void parseJpeg(File file); protected void parseTiff(InputStream stream); both simply extract all possible metadata from image file or stream. Would be nice if parseTiff could also be public so it will be easier to create custom parsers located in external jars that use this functionality. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-962) Backwards Compatibility for Metadata.LAST_AUTHOR is Broken
[ https://issues.apache.org/jira/browse/TIKA-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-962. --- Resolution: Fixed This has been fixed, but I didn't resolve for 1.3 as I thought it might be worthy of a fix release. Backwards Compatibility for Metadata.LAST_AUTHOR is Broken -- Key: TIKA-962 URL: https://issues.apache.org/jira/browse/TIKA-962 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.2 Reporter: Ray Gauss II Assignee: Ray Gauss II Priority: Critical Fix For: 1.3 As a result of changes in TIKA-930, support for the deprecated Metadata.LAST_AUTHOR property has been dropped. The new TikaCoreProperties.MODIFIED should be a composite property containing Metadata.LAST_AUTHOR. Should we consider a fix release for this? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-963) Backwards Compatibility for Metadata.DATE is Incorrect
[ https://issues.apache.org/jira/browse/TIKA-963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-963. --- Resolution: Fixed This has been fixed, but I didn't resolve for 1.3 as I thought it might be worthy of a fix release. Backwards Compatibility for Metadata.DATE is Incorrect -- Key: TIKA-963 URL: https://issues.apache.org/jira/browse/TIKA-963 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.2 Reporter: Ray Gauss II Assignee: Ray Gauss II Priority: Critical Fix For: 1.3 Metadata.DATE was always somewhat ambiguous, but during the consolidation in TIKA-930 it was incorrectly assumed that most parsers used it as a creation date. Metadata.DATE needs to instead be part of the TikaCoreProperties.MODIFIED composite property. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [DISCUSS] Release Candidate for 1.3?
The code for TIKA-775 [1] is on trunk but it was re-opened with some concerns, some of which were addressed and some of which are still open discussions, though I think minor enough to create separate issues if need be and resolve TIKA-775 as fixed. [1] https://issues.apache.org/jira/browse/TIKA-775 On Jan 8, 2013, at 4:56 PM, Dave Meikle loo...@gmail.com wrote: Hi All, We have got some new features and bugs fixed with a couple of outstanding binary compatibility ones (TIKA-962, TIKA-963) fixed on trunk, so I was wondering if it was time for a 1.3 release? Also, happy to do the Release Management for it. Cheers, Dave
[jira] [Resolved] (TIKA-895) Empty title element makes Tika-generated HTML documents not open
[ https://issues.apache.org/jira/browse/TIKA-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-895. --- Resolution: Duplicate Assignee: Ray Gauss II Empty title element makes Tika-generated HTML documents not open Key: TIKA-895 URL: https://issues.apache.org/jira/browse/TIKA-895 Project: Tika Issue Type: Bug Components: metadata Affects Versions: 1.1 Environment: Windows 7 Reporter: Benoit MAGGI Assignee: Ray Gauss II Priority: Trivial Labels: newbie I try to transform an empty docx to an html file. Ex : java -jar tika-app-1.1.jar -x example.docx t.html The html file can't be open with Firefox,Internet Explorer and Chrome. The main point is that title/ seems to be forbiden by html specification (can't get the point on html5) bq. http://www.w3.org/TR/html401/struct/global.html#h-7.4.2 bq. 7.4.2 The TITLE element bq. !-- The TITLE element is not considered part of the flow of text. bq.It should be displayed, for example as the page header or bq.window title. Exactly one title is required per document. bq. -- bq. !ELEMENT TITLE http://www.w3.org/TR/html401/struct/global.html#edef-TITLE - - (#PCDATA) -(%head.misc; bq. http://www.w3.org/TR/html401/sgml/dtd.html#head.misc ) -- document title -- bq. !ATTLIST TITLE %i18n http://www.w3.org/TR/html401/sgml/dtd.html#i18n bq. *Start tag: required, End tag: required* For information there was the same bug with xls https://issues.apache.org/jira/browse/TIKA-725 The simple solution should be to provide an empty title by default -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium
[ https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II reopened TIKA-725: --- Assignee: Ray Gauss II (was: Jukka Zitting) Confirmed that the problem remains when a {{TransformerHandler}} is used, such those obtained from {{SAXTransformerFactory}} in {{TikaCLI}} and {{TikaGUI}}. I've investigated and developed a workaround. Empty title element makes Tika-generated HTML documents not open in Chromium Key: TIKA-725 URL: https://issues.apache.org/jira/browse/TIKA-725 Project: Tika Issue Type: Bug Components: general Affects Versions: 0.9 Environment: Chromium 12 on Ubuntu Linux Reporter: Henri Bergius Assignee: Ray Gauss II Priority: Minor Labels: html Fix For: 0.10 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an empty title element as title/ into the document HEAD section. This causes Chromium not to display the document contents. Switching it to title/title fixes this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-914) Invalid self-closing title tag when parsing an RTF file
[ https://issues.apache.org/jira/browse/TIKA-914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-914. --- Resolution: Duplicate Assignee: Ray Gauss II Invalid self-closing title tag when parsing an RTF file --- Key: TIKA-914 URL: https://issues.apache.org/jira/browse/TIKA-914 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.1 Environment: Reproduced on Linux and Windows Reporter: Nicolas Guillaumin Assignee: Ray Gauss II Priority: Minor Labels: rtf Attachments: test.rtf When parsing an RTF file with an empty TITLE metadata, the resulting HTML contains an self-closing title tag: {code} $ java -jar tika-app-1.1.jar -h test.rtf html xmlns=http://www.w3.org/1999/xhtml; head meta name=Content-Length content=830468/ meta name=Content-Type content=application/rtf/ meta name=resourceName content=test.rtf/ title/ /head [...] {code} I believe self-closing tags are not valid in XHTML, according to http://www.w3.org/TR/xhtml1/#C_3 (However there's no XHTML doctype generated here, just a namespace...). Anyway this causes some browsers like Chrome to fail parsing the HTML, resulting in a blank page displayed. The expected output would be a non self-closing empty tag: {{title/title}} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-725) Empty title element makes Tika-generated HTML documents not open in Chromium
[ https://issues.apache.org/jira/browse/TIKA-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ray Gauss II resolved TIKA-725. --- Resolution: Fixed Fix Version/s: 1.3 When a {{TransformerHandler}} is used the actual writing of the final elements is delegated to an XML serializer such as {{ToHTMLStream}} which extends {{ToStream}}. When {{ToStream.characters}} is called with zero length it returns immediately and does not close the start tag of the current element, and {{ToStream.endElement}} checks whether the start tag is open to determine whether or not to close as {{title/}} or {{title/title}}. It seems the code brought over from the xalan project to the JDK was locked down quite a bit during the transition. When using xalan directly an alternate XML serializer can be specified via XSLT or other means [1], but in the JDK that functionality seems to have been removed as {{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream hard-coded. Additionally, ToHTMLStream is declared as final and the majority of the classes which one would normally extend to use a different {{TransletOutputHandlerFactory}} are internal, so a proper solution would likely involve depending on xalan directly or duplicating a whole lot of code, neither of which is ideal. As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator was added which checks for the previous fix for this issue, a call to {{characters(new char[0], 0, 0)}} for the title element, and if present changes the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}} thrown by {{ToStream.characters}}. The result is that the title start tag is closed since the check for zero length passes and no character writing is attempted. {{TikaCLI}} was modified to wrap the transformer handler returned by {{SAXTransformerFactory}} for the {{html}} output method, so only handling of the {{title}} tag for HTML output will be affected by the change. In the event that this approach has adverse effects for those using XML serializers other than those present in the JDK, the change to {{TikaCLI}} can be reverted or made an option. Those calling Tika programmatically will need to wrap their transformer handlers in a {{ExpandedTitleContentHandler}} as well, i.e.: SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProperty(OutputKeys.METHOD, html); handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent); handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding); handler.setResult(new StreamResult(output)); return new ExpandedTitleContentHandler(handler); Resolved in r1423538. [1] http://xml.apache.org/xalan-j/usagepatterns.html Empty title element makes Tika-generated HTML documents not open in Chromium Key: TIKA-725 URL: https://issues.apache.org/jira/browse/TIKA-725 Project: Tika Issue Type: Bug Components: general Affects Versions: 0.9 Environment: Chromium 12 on Ubuntu Linux Reporter: Henri Bergius Assignee: Ray Gauss II Priority: Minor Labels: html Fix For: 1.3, 0.10 Currently when converting Excel sheets (both XLS and XLSX), Tika generates an empty title element as title/ into the document HEAD section. This causes Chromium not to display the document contents. Switching it to title/title fixes this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira