[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14655279#comment-14655279
 ] 

Nick Burch commented on TIKA-1607:
----------------------------------

I don't see any problems with deprecating the setters that take String values, 
we do want to force parsers to do the extra work to give typed values if 
possible. Not sure we need to deprecate the String getters though, the nice 
thing about the new system is that downstream users can stick with simple 
strings if they want, or opt into richer typed values if they don't

Approach wise though, it does seem to be going a different way to what Ray 
proposed with his plan for "How should video files with audio be handled by 
parsers?". That put the nesting work onto the property, not the value. I might 
be wrong, but it would seem that for the phone number case Lewis mentioned in 
the bug description, this would go for
{code}
// contact:phone_number
metadata.set(PhoneNumber, blobOfStuff);
...
String number = metadata.getMetadataValue(PhoneNumber).getNumber();
String country = metadata.getMetadataValue(PhoneNumber).getCountryCode();
{code}
While ray's approach would be more work on the parser side, giving something 
more like
{code}
// contact:phone_number
metadata.set(PhoneNumber, phoneValueStr);
// contact:phone_number/countryCode
metadata.set(PhoneNumber.countryCode(), getCountryCode(phoneValueStr));

String number = metadata.get(PhoneNumber);
// or
number = metadata.getMetadataValue(PhoneNumber).getString();
String country = metadata.get(PhoneNumber.countryCode());
{code}

In Ray's model, to check if the country code decoration or the 2nd audio track 
channel count was available, you'd check for the presence/absense of an 
extension property. In this model, you'd get a MetadataValue, then see if it 
was a base one or an extended one, then cast it, then fetch child values, then 
check them, cast them, fetch next child etc.

Maybe the next step is for both Tim and myself (since Ray seems busy) to write 
how, in our respective models, we'd encode + read back Lewis's phone numbers 
with country codes, and the channel count + type + sample rate of the two audio 
tracks for a video. Let's see some code! :)

> Introduce new arbitrary object key/values data structure for persitsence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.10
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to