[ 
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giuseppe Totaro updated TIKA-1691:
----------------------------------
    Attachment: mapping_example.pdf

> Apache Tika for enabling metadata interoperability
> --------------------------------------------------
>
>                 Key: TIKA-1691
>                 URL: https://issues.apache.org/jira/browse/TIKA-1691
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: mapping, metadata
>         Attachments: mapping_example.pdf
>
>
> If am not wrong, enabling consistent metadata across file formats is already 
> (partially) provided into Tika by relying on {{TikaCoreProperties}} and, 
> within the context of Solr, {{ExtractingRequestHandler}} (by defining how to 
> map metadata fields in {{solrconfig.xml}}). However, I am working on a new 
> component for both schema mapping (to operate on the name of metadata 
> properties) and instance transformation (to operate on the value of metadata) 
> that consists, essentially, of the following changes:
> * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates 
> the {{set}} method (currently, line number 367 of {{Metadata.java}}) by 
> applying the given mapping functions (via configuration) before setting 
> metadata properties.
> * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility 
> methods to map a set of metadata to the target schema.
> * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be 
> configured via XML file (organized as showed in the following snippet) and 
> allows to perform a fine-grained metadata mapping by using Java reflection.
> {code:xml|title=tika-metadata.xml|borderStyle=solid}
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
>   <mappings>
>     <mapping type="type/sub-type">
>       <relation name="SOURCE_FIELD">
>         <target>TARGET_FIELD</target>
>         <expression>exclude|include|equivalent|overlap</expression>
>         <function name="FUNCTION_NAME">
>           <argument>ARGUMENT_VALUE</argument>
>         </function>
>         <cardinality>
>           <source>SOURCE_CARDINALITY</source>
>           <target>TARGET_CARDINALITY</target>
>           <order>ORDER_NUMBER</order>
>           <dependencies>
>             <field>FIELD_NAME</field>
>           </dependencies>
>         </cardinality>
>       </relation>
>     </mapping>
>     ...
>     <mapping> <!-- This contains the fallback strategy for unknown metadata 
> -->
>       <relation>
>         ...
>       </relation>
>     <mapping>
>   </mappings>
> </properties>
> {code}
> The theoretical definition of metadata mapping is available in "[A survey of 
> techniques for achieving metadata 
> interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]";.
>  This paper shows also some basic examples of metadata mappings.
> Currently, I am still working on some core functionalities, but I have 
> already performed some experiments by using a small prototype.
> By the way, I think that we should modify the method {{add}} in order to use 
> {{set}} instead of {{metadata.put}} (currently, line number 316 of 
> {{Metadata.java}}). This is a trivial change (I could create a new Jira issue 
> about that), but it would allow to be coherent with the other implementation 
> of {{add}} method and, moreover, the methods of {{Metadata}} could be 
> extended more easily.
> I would really appreciate your feedback about this proposal. If you believe 
> that it is a good idea, I could provide the code in few days.
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to