[
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655279#comment-14655279
]
Nick Burch commented on TIKA-1607:
----------------------------------
I don't see any problems with deprecating the setters that take String values,
we do want to force parsers to do the extra work to give typed values if
possible. Not sure we need to deprecate the String getters though, the nice
thing about the new system is that downstream users can stick with simple
strings if they want, or opt into richer typed values if they don't
Approach wise though, it does seem to be going a different way to what Ray
proposed with his plan for "How should video files with audio be handled by
parsers?". That put the nesting work onto the property, not the value. I might
be wrong, but it would seem that for the phone number case Lewis mentioned in
the bug description, this would go for
{code}
// contact:phone_number
metadata.set(PhoneNumber, blobOfStuff);
...
String number = metadata.getMetadataValue(PhoneNumber).getNumber();
String country = metadata.getMetadataValue(PhoneNumber).getCountryCode();
{code}
While ray's approach would be more work on the parser side, giving something
more like
{code}
// contact:phone_number
metadata.set(PhoneNumber, phoneValueStr);
// contact:phone_number/countryCode
metadata.set(PhoneNumber.countryCode(), getCountryCode(phoneValueStr));
String number = metadata.get(PhoneNumber);
// or
number = metadata.getMetadataValue(PhoneNumber).getString();
String country = metadata.get(PhoneNumber.countryCode());
{code}
In Ray's model, to check if the country code decoration or the 2nd audio track
channel count was available, you'd check for the presence/absense of an
extension property. In this model, you'd get a MetadataValue, then see if it
was a base one or an extended one, then cast it, then fetch child values, then
check them, cast them, fetch next child etc.
Maybe the next step is for both Tim and myself (since Ray seems busy) to write
how, in our respective models, we'd encode + read back Lewis's phone numbers
with country codes, and the channel count + type + sample rate of the two audio
tracks for a video. Let's see some code! :)
> Introduce new arbitrary object key/values data structure for persitsence of
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
> Issue Type: Improvement
> Components: core, metadata
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Critical
> Fix For: 1.10
>
> Attachments: TIKA-1607v1_rough_rough.patch,
> TIKA-1607v2_rough_rough.patch
>
>
> I am currently working implementing more comprehensive extraction and
> enhancement of the Tika support for Phone number extraction and metadata
> modeling.
> Right now we utilize the String[] multivalued support available within Tika
> to persist phone numbers as
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the
> String[] paradigm by implementing a more abstract Collection of Objects such
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String: Object
> {code}
> Where Object could be a Collection<HashMap<String/Property,
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US),
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054:
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...)
> (etc)]
> {code}
> There are obvious backwards compatibility issues with this approach...
> additionally it is a fundamental change to the code Metadata API. I hope that
> the <String, Object> Mapping however is flexible enough to allow me to model
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)