[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15148608#comment-15148608
 ] 

Tim Allison edited comment on TIKA-1607 at 2/16/16 1:46 PM:
------------------------------------------------------------

I'd like to turn something like the above "thought" into a proposal...

Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA 
contents into the ContentHandler for PDFDocuments.

It might be more elegant to store the XFA in the metadata object and let 
consumers process that stream.

Would anyone object to adding a two new {{ValueType}} s of Property: BYTES and 
DOM.

Both would be stored as String values (base-64 encoded {{byte[]}}) in the 
regular {{Metadata}} object.

Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, 
we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded 
{{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would 
return a {{org.w3c.dom.Document}}.

We could also store raw XMP by this mechanism.

Is this a reasonable first (half) step towards this issue?  Any objections?


was (Author: talli...@mitre.org):
I'd like to turn something like the above "thought" into a proposal...

Over on TIKA-1857, [~pascal.essiembre] has opened a pull request to strip XFA 
contents into the ContentHandler for PDFDocuments.

It might be more elegant to store the XFA in the metadata object and let 
consumers process that stream.

Would anyone object to adding a two new {{ValueType}}s of Property: BYTES and 
DOM.

Both would be stored as String values (base-64 encoded {{byte[]}}) in the 
regular {{Metadata}} object.

Similar with what we're doing with {{getDate()}} in the {{Metadata}} object, 
we'd add a {{getBytes(Property binaryProperty)}} that would return a decoded 
{{byte[]}}, and we could also add a {{getDOM(Property domProperty)}} that would 
return a {{org.w3c.dom.Document}}.

We could also store raw XMP by this mechanism.

Is this a reasonable first (half) step towards this issue?  Any objections?

> Introduce new arbitrary object key/values data structure for persistence of 
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.13
>
>         Attachments: TIKA-1607v1_rough_rough.patch, 
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to