[
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167208#comment-15167208
]
Tim Allison commented on TIKA-1607:
-----------------------------------
Aside from XMP, I can't think of an example where we'd have multiple DOMs of
the same type (property name). For some (rare) PDF files, I could see having a
DOM for XFA and one or more DOMs for XMP, but they'd be under different
keys...in my current plan.
I could also see someone modifying an existing parser to generate a DOM to this
type of field, say, by translating what we're pulling out of the metadata for a
multimedia file into pbcore.
On the one hand, this is a hack on the way to your unified DOM proposal...basic
users can get what they want from key/value, and advanced users who actually
know a given standard can find what they need.
On the other, this would allow advanced users to extract potentially
conflicting metadata (one XMP packet has dc:creator X, but the update XMP
packet has dc:creator Y...and we even have this in one of our test files :)).
By following the XMP standard (iirc), the more recent packet information would
overwrite the earlier. Some users will want the "standard" (dc:creator=Y);
some advanced users might want "all" (dc:creator=X;Y).
The initial motivation for giving access to the raw bytes...if we allow access
to the raw bytes for a DOM, this could also allow super advanced users to run
their own content stripping that might not care about slightly dodgy/invalid
xml, and we already have an example of invalid XMP in one of our multimedia
files.
However, I'm persuaded that making "bytes" available could lead to disaster.
> Introduce new arbitrary object key/values data structure for persistence of
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
> Issue Type: Improvement
> Components: core, metadata
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Critical
> Fix For: 1.13
>
> Attachments: TIKA-1607_bytes_dom_values.patch,
> TIKA-1607v1_rough_rough.patch, TIKA-1607v2_rough_rough.patch,
> TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and
> enhancement of the Tika support for Phone number extraction and metadata
> modeling.
> Right now we utilize the String[] multivalued support available within Tika
> to persist phone numbers as
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the
> String[] paradigm by implementing a more abstract Collection of Objects such
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String: Object
> {code}
> Where Object could be a Collection<HashMap<String/Property,
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US),
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054:
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...)
> (etc)]
> {code}
> There are obvious backwards compatibility issues with this approach...
> additionally it is a fundamental change to the code Metadata API. I hope that
> the <String, Object> Mapping however is flexible enough to allow me to model
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)