[
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503982#comment-14503982
]
Nick Burch edited comment on TIKA-1607 at 4/20/15 11:53 PM:
------------------------------------------------------------
Historically, we've always required that things on Metadata be a String, both
key and value. Properties provide support for converting to/from Strings to
more helpful types, but allow backwards compatible and simple fetching for
people who don't want that
Based on the phone number example, this looks somewhat like the streams-style
indexed metadata that we've been discussing for video and audio, eg "video
stream 1 has width 640 + height 480, video stream 2 has width 320 + height 240,
audio stream 1 is stereo + 44.1kHz + english" etc.
Maybe we should work to finish that indexed support off? We'd then keep strings
everywhere in the metadata, we'd keep backwards compatibility, and we'd keep
things consistent between different styles of metadata (video, audio, phone
etc!)
The thread "How should video files with audio be handled by parsers?" from last
summer outlines a plan, [~rgauss] was going to try and prototype it first
before committing. (That thread already has an example of how contact cards
with phone number based details might work, which ought to cover your phone
number additional details info too!)
was (Author: gagravarr):
Historically, we've always required that things on Metadata be a String, both
key and value. Properties provide support for converting to/from Strings to
more helpful types, but allow backwards compatible and simple fetching for
people who don't want that
Based on the phone number example, this looks somewhat like the streams-style
indexed metadata that we've been discussing for video and audio, eg "video
stream 1 has width 640 + height 480, video stream 2 has width 320 + height 240,
audio stream 1 is stereo + 44.1kHz + english" etc.
Maybe we should work to finish that indexed support off? We'd then keep strings
everywhere in the metadata, we'd keep backwards compatibility, and we'd keep
things consistent between different styles of metadata (video, audio, phone
etc!)
The thread "How should video files with audio be handled by parsers?" from last
summer outlines a plan, [~rgauss] was going to try and prototype it first
before committing.
> Introduce new HashMap<String, Object> data structure for persitsence of Tika
> Metadata
> -------------------------------------------------------------------------------------
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
> Issue Type: Improvement
> Components: core, metadata
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Critical
> Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and
> enhancement of the Tika support for Phone number extraction and metadata
> modeling.
> Right now we utilize the String[] multivalued support available within Tika
> to persist phone numbers as
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the
> String[] paradigm by implementing a more abstract Collection of Objects such
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String: Object
> {code}
> Where Object could be a Collection<HashMap<String/Property,
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US),
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054:
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...)
> (etc)]
> {code}
> There are obvious backwards compatibility issues with this approach...
> additionally it is a fundamental change to the code Metadata API. I hope that
> the <String, Object> Mapping however is flexible enough to allow me to model
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)