Ray Gauss II created TIKA-1133:
----------------------------------
Summary: Ability to Allow Empty and Duplicate Tika Values for XML
Elements
Key: TIKA-1133
URL: https://issues.apache.org/jira/browse/TIKA-1133
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 1.3
Reporter: Ray Gauss II
Assignee: Ray Gauss II
In some cases it is beneficial to allow empty and duplicate Tika metadata
values for multi-valued XML elements like RDF bags.
Consider an example where the original source metadata is structured something
like:
{code}
<Person>
<FirstName>John</FirstName>
<LastName>Smith</FirstName>
</Person>
<Person>
<FirstName>Jane</FirstName>
<LastName>Doe</FirstName>
</Person>
<Person>
<FirstName>Bob</FirstName>
</Person>
<Person>
<FirstName>Kate</FirstName>
<LastName>Smith</FirstName>
</Person>
{code}
and since Tika stores only flat metadata we transform that before invoking a
parser to something like:
{code}
<custom:FirstName>
<rdf:Bag>
<rdf:li>John</rdf:li>
<rdf:li>Jane</rdf:li>
<rdf:li>Bob</rdf:li>
<rdf:li>Kate</rdf:li>
</rdf:Bag>
</custom:FirstName>
<custom:LastName>
<rdf:Bag>
<rdf:li>Smith</rdf:li>
<rdf:li>Doe</rdf:li>
<rdf:li></rdf:li>
<rdf:li>Smith</rdf:li>
</rdf:Bag>
</custom:LastName>
{code}
The current behavior ignores empties and duplicates and we don't know if Bob or
Kate ever had last names. Empties or duplicates in other positions result in
an incorrect mapping of data.
We should allow the option to create an {{ElementMetadataHandler}} which allows
empty and/or duplicate values.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira