[
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703924#comment-14703924
]
Ray Gauss II commented on TIKA-1607:
------------------------------------
I've put together the start of the DOM metadata store option on [GitHub as
well|https://github.com/apache/tika/compare/trunk...rgauss:trunk].
The crux of the change is using a {{org.w3c.dom.Document}} object instead of a
{{Map<String, String[]>}} as the metadata store and Property objects based on
{{QName}}s instead of Strings.
A few things to note:
* This does bring in commons-lang for XML escaping, we could change if need be
* It seems mostly backwards compatible. tika-xmp is failing at the moment, but
I think it's just a matter of applying the same techniques there
* String-based accessors weren't deprecated, but could be if targeting Tika 2.0
* There are several TODOs that would still need to be addressed
The [test
added|https://github.com/rgauss/tika/blob/trunk/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java#L394]
demonstrates creating a DOM structure, adding it to the metadata, then pulling
it out both programmatically and via XPath expression (sticking to the
telephone number example).
That programmatic creation of the DOM structure is a bit cumbersome and we
could certainly employ Java classes specific to each standard as a convenience
(somewhat similar to [[email protected]]'s proposal), but I do like the
generic nature of the DOM store.
The {{toString}} method of the metadata object after building that example is
properly structured and namespaced XML:
{code:xml}
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tika:metadata xmlns:tika="http://tika.apache.org/">
<vcard:tel xmlns:vcard="urn:ietf:params:xml:ns:vcard-4.0">
<vcard:parameters>
<vcard:type>
<vcard:text>work</vcard:text>
</vcard:type>
</vcard:parameters>
<vcard:uri>tel:+1-800-555-1234</vcard:uri>
</vcard:tel>
</tika:metadata>
{code}
There's obviously lots of room for improvement and discussion but I wanted to
put it out there before the momentum on this slows.
> Introduce new arbitrary object key/values data structure for persistence of
> Tika Metadata
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-1607
> URL: https://issues.apache.org/jira/browse/TIKA-1607
> Project: Tika
> Issue Type: Improvement
> Components: core, metadata
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Critical
> Fix For: 1.11
>
> Attachments: TIKA-1607v1_rough_rough.patch,
> TIKA-1607v2_rough_rough.patch, TIKA-1607v3.patch
>
>
> I am currently working implementing more comprehensive extraction and
> enhancement of the Tika support for Phone number extraction and metadata
> modeling.
> Right now we utilize the String[] multivalued support available within Tika
> to persist phone numbers as
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the
> String[] paradigm by implementing a more abstract Collection of Objects such
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String: Object
> {code}
> Where Object could be a Collection<HashMap<String/Property,
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US),
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054:
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...)
> (etc)]
> {code}
> There are obvious backwards compatibility issues with this approach...
> additionally it is a fundamental change to the code Metadata API. I hope that
> the <String, Object> Mapping however is flexible enough to allow me to model
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)