[ https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Giuseppe Totaro updated TIKA-1691: ---------------------------------- Attachment: mapping_example.pdf > Apache Tika for enabling metadata interoperability > -------------------------------------------------- > > Key: TIKA-1691 > URL: https://issues.apache.org/jira/browse/TIKA-1691 > Project: Tika > Issue Type: New Feature > Reporter: Giuseppe Totaro > Assignee: Giuseppe Totaro > Labels: mapping, metadata > Attachments: mapping_example.pdf > > > If am not wrong, enabling consistent metadata across file formats is already > (partially) provided into Tika by relying on {{TikaCoreProperties}} and, > within the context of Solr, {{ExtractingRequestHandler}} (by defining how to > map metadata fields in {{solrconfig.xml}}). However, I am working on a new > component for both schema mapping (to operate on the name of metadata > properties) and instance transformation (to operate on the value of metadata) > that consists, essentially, of the following changes: > * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates > the {{set}} method (currently, line number 367 of {{Metadata.java}}) by > applying the given mapping functions (via configuration) before setting > metadata properties. > * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility > methods to map a set of metadata to the target schema. > * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be > configured via XML file (organized as showed in the following snippet) and > allows to perform a fine-grained metadata mapping by using Java reflection. > {code:xml|title=tika-metadata.xml|borderStyle=solid} > <?xml version="1.0" encoding="UTF-8" standalone="no"?> > <properties> > <mappings> > <mapping type="type/sub-type"> > <relation name="SOURCE_FIELD"> > <target>TARGET_FIELD</target> > <expression>exclude|include|equivalent|overlap</expression> > <function name="FUNCTION_NAME"> > <argument>ARGUMENT_VALUE</argument> > </function> > <cardinality> > <source>SOURCE_CARDINALITY</source> > <target>TARGET_CARDINALITY</target> > <order>ORDER_NUMBER</order> > <dependencies> > <field>FIELD_NAME</field> > </dependencies> > </cardinality> > </relation> > </mapping> > ... > <mapping> <!-- This contains the fallback strategy for unknown metadata > --> > <relation> > ... > </relation> > <mapping> > </mappings> > </properties> > {code} > The theoretical definition of metadata mapping is available in "[A survey of > techniques for achieving metadata > interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]". > This paper shows also some basic examples of metadata mappings. > Currently, I am still working on some core functionalities, but I have > already performed some experiments by using a small prototype. > By the way, I think that we should modify the method {{add}} in order to use > {{set}} instead of {{metadata.put}} (currently, line number 316 of > {{Metadata.java}}). This is a trivial change (I could create a new Jira issue > about that), but it would allow to be coherent with the other implementation > of {{add}} method and, moreover, the methods of {{Metadata}} could be > extended more easily. > I would really appreciate your feedback about this proposal. If you believe > that it is a good idea, I could provide the code in few days. > Thanks a lot, > Giuseppe -- This message was sent by Atlassian JIRA (v6.3.4#6332)