[ 
https://issues.apache.org/jira/browse/TIKA-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14653428#comment-14653428
 ] 

Nick Burch commented on TIKA-1691:
----------------------------------

We generally apply a higher bar to things going into Tika Core than 
sub-modules, in part because they have an immediately higher impact, and the 
rules on changing/deprecating/compatibility are stronger. That often means we 
need to ask more questions first!

One of the contracts of the Tika metadata system is that it should provide a 
format-agnostic view of the metadata as best as it can. You, as the end user of 
Tika, shouldn't need to know if one format calls it Author, one Creator, one 
Created By, Tika's parsers handle that mapping for you internally. If there are 
cases where Tika isn't doing that properly, we want to know! We should be 
adding more properties definitions, and setting those mappings in all the 
parsers. That normalisation between formats is something Tika should do for 
everyone, not something that individual users should need to worry themselves 
with. If gaps exist, please raise tickets

There are a number of downstream users of Tika Metadata who transform/translate 
the output. The Tika XMP module is one such, Alfresco's metadata extractor 
mapping another, JackRabbit has one too, SOLR has one etc. We have json 
serialisation as well. At least some of us would find it a bit odd to see 
Alfresco metadata properties, or SOLR field definitions inside what's held in 
the Tika Metadata object! All those projects seem to find it fine to read out 
metadata keys+values, and map it into their own model on their side. If there's 
another common downstream format, we should look to add a module / set of 
serialisation classes for that too. 

Is there is a use-case for runtime-specific downstream mappings, such that when 
you run it on one machine and/or dataset you want dc:subject to map to 
custom:Long_Title, but on another it's custom:Short_Title? If that's it, I 
could probably see the case for a runtime-configurable wrapper/serializer, but 
some more details on the use-case would be helpful, so we can make it easy to 
use / extend / integrate with / etc.

If there's something else you're trying to do, please could you explain the 
use-case some more? Possibly on a wiki page, if it gets too hard to do here, 
with some examples. We're not all doing exactly the same things, so what's 
obvious for one person might not be for another! We're not all rocket 
scientists here... ;-) If we can get it explained, then we can all help refine 
the design as needed, ensure it's as supported and widely usable as possible, 
and documented in a way that new community members can understand too!

(The mapping example given in the PDF looks to be something that Tika ought to 
be doing already, so if there are cases when it isn't then those are bugs!)

> Apache Tika for enabling metadata interoperability
> --------------------------------------------------
>
>                 Key: TIKA-1691
>                 URL: https://issues.apache.org/jira/browse/TIKA-1691
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Giuseppe Totaro
>            Assignee: Giuseppe Totaro
>              Labels: mapping, metadata
>         Attachments: mapping_example.pdf
>
>
> If am not wrong, enabling consistent metadata across file formats is already 
> (partially) provided into Tika by relying on {{TikaCoreProperties}} and, 
> within the context of Solr, {{ExtractingRequestHandler}} (by defining how to 
> map metadata fields in {{solrconfig.xml}}). However, I am working on a new 
> component for both schema mapping (to operate on the name of metadata 
> properties) and instance transformation (to operate on the value of metadata) 
> that consists, essentially, of the following changes:
> * A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates 
> the {{set}} method (currently, line number 367 of {{Metadata.java}}) by 
> applying the given mapping functions (via configuration) before setting 
> metadata properties.
> * Basic mapping functions ({{BasicMappingUtils.java}}) that are utility 
> methods to map a set of metadata to the target schema.
> * A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be 
> configured via XML file (organized as showed in the following snippet) and 
> allows to perform a fine-grained metadata mapping by using Java reflection.
> {code:xml|title=tika-metadata.xml|borderStyle=solid}
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
>   <mappings>
>     <mapping type="type/sub-type">
>       <relation name="SOURCE_FIELD">
>         <target>TARGET_FIELD</target>
>         <expression>exclude|include|equivalent|overlap</expression>
>         <function name="FUNCTION_NAME">
>           <argument>ARGUMENT_VALUE</argument>
>         </function>
>         <cardinality>
>           <source>SOURCE_CARDINALITY</source>
>           <target>TARGET_CARDINALITY</target>
>           <order>ORDER_NUMBER</order>
>           <dependencies>
>             <field>FIELD_NAME</field>
>           </dependencies>
>         </cardinality>
>       </relation>
>     </mapping>
>     ...
>     <mapping> <!-- This contains the fallback strategy for unknown metadata 
> -->
>       <relation>
>         ...
>       </relation>
>     <mapping>
>   </mappings>
> </properties>
> {code}
> The theoretical definition of metadata mapping is available in "[A survey of 
> techniques for achieving metadata 
> interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]";.
>  This paper shows also some basic examples of metadata mappings.
> Currently, I am still working on some core functionalities, but I have 
> already performed some experiments by using a small prototype.
> By the way, I think that we should modify the method {{add}} in order to use 
> {{set}} instead of {{metadata.put}} (currently, line number 316 of 
> {{Metadata.java}}). This is a trivial change (I could create a new Jira issue 
> about that), but it would allow to be coherent with the other implementation 
> of {{add}} method and, moreover, the methods of {{Metadata}} could be 
> extended more easily.
> I would really appreciate your feedback about this proposal. If you believe 
> that it is a good idea, I could provide the code in few days.
> Thanks a lot,
> Giuseppe



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to