[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

Tim Allison (JIRA) Wed, 16 Sep 2015 04:32:30 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747338#comment-14747338
 ]


Tim Allison edited comment on TIKA-1607 at 9/16/15 11:31 AM:
-------------------------------------------------------------

Thank you, [~rgauss], for your thoughtful responses and example code!


Y, you're absolutely right about POJOs and helper classes for the common 
elements.  Thank you.

I agree that most of my comments really had to do with pass through.

bq. we'd first have to 'merge' with the metadata being modeled by the parsers 
and could then allow access to the full DOM Document object which clients could 
easily serialize to a string if need be.

Agreed, my thought is a crude/knuckle-dragging/transitional approach for this 
type of merge using our current simple structure.  If there is an XMP packet 
(or multiple as you point out) or any other type of xmlified standard in a 
document, use our current simple structure and store the XMP as a String (or 
encoded byte[]?) and let clients parse the String/byte[] to DOM or we could add 
a new property type ("DOM") with a helper method that returns a DOM object 
(similar to what we're doing now with {{getDate()}} and {{getInt()}}.  This 
would be in addition to pulling out the most commonly used Dublin Core elements 
(as we're doing now) into our current structure (or maybe not if there is a 
conflict with native metadata???).

If the xmlified standard doesn't exist in the document, but there is a 
known+obvious standard(e.g. PBCore), the parser could generate that XML String 
from the file's metadata and store it as an element in our current structure: 
Property pbcore...or similar.

Another benefit to this transitional approach is that we could store both the 
original XMP(s) (e.g.)  _and_ the native metadata, and we wouldn't have to 
worry about deconflicting...the user can recover that the XMP said the author 
was "Joe Smith" but the native metadata said the author was "Bob Doe".  
Currently, at least in the PDFParser, we're overwriting native metadata items 
with XMP metadata.  Perhaps, though, this is an edge case, and most users just 
want "one answer"...

Another benefit is that if there is something non-standard/unparseable in the 
stored XML string, the client could still recover the String (or byte[]?) that 
was stored in the original document via the current {{get(Property property}}.

bq. bring these different sources into a unified persistence structure

The above "thought" (not even a proposal!) tentatively approaches that in an 
inelegant way, and it has a strong odor of hack that I don't like.  I very much 
appreciate the goal of a unified structure!


was (Author: talli...@mitre.org):
Thank you, [~rgauss], for your thoughtful responses and example code!


Y, you're absolutely right about POJOs and helper classes for the common 
elements.  Thank you.

I agree that most of my comments really had to do with pass through.

bq. we'd first have to 'merge' with the metadata being modeled by the parsers 
and could then allow access to the full DOM Document object which clients could 
easily serialize to a string if need be.

Agreed, my thought is a crude/knuckle-dragging/transitional approach for this 
type of merge using our current simple structure.  If there is an XMP packet 
(or multiple as you point out) or any other type of xmlified standard in a 
document, use our current simple structure and store the XMP as a String and 
let clients parse the String to DOM or we could add a new property type ("DOM") 
with a helper method that returns a DOM object (similar to what we're doing now 
with {{getDate()}} and {{getInt()}}.  This would be in addition to pulling out 
the most commonly used Dublin Core elements (as we're doing now) into our 
current structure (or maybe not if there is a conflict with native metadata???).

If the xmlified standard doesn't exist in the document, but there is a 
known+obvious standard(e.g. PBCore), the parser could generate that XML String 
from the file's metadata and store it as an element in our current structure: 
Property pbcore...or similar.

Another benefit to this transitional approach is that we could store both the 
original XMP(s) (e.g.)  _and_ the native metadata, and we wouldn't have to 
worry about deconflicting...the user can recover that the XMP said the author 
was "Joe Smith" but the native metadata said the author was "Bob Doe".  
Currently, at least in the PDFParser, we're overwriting native metadata items 
with XMP metadata.  Perhaps, though, this is an edge case, and most users just 
want "one answer"...

Another benefit is that if there is something non-standard/unparseable in the 
stored XML string, the client could still recover the String (or byte[]?) that 
was stored in the original document via the current {{get(Property property}}.

bq. bring these different sources into a unified persistence structure

The above "thought" (not even a proposal!) tentatively approaches that in an 
inelegant way, and it has a strong odor of hack that I don't like.  I very much 
appreciate the goal of a unified structure!

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.11
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1607) Introduce new arbitrary object key/values data structure for persistence of Tika Metadata

Reply via email to