Tim Allison created TIKA-1814:
---------------------------------
Summary: Add a standalone XMPScannerParser
Key: TIKA-1814
URL: https://issues.apache.org/jira/browse/TIKA-1814
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
Several parsers make use of XMP data and normalize it via dc or other standards
into our metadata object. We're currently either relying on dependencies to
make sense of multiple XMP packets within a file (PDFBox for PDFParser) or
we're just grabbing the first (TiffParser via JempboxExtractor and
XMPPacketScanner) or...which other parsers are processing XMP?
It might be useful to extract all XMPPackets from a file and store those raw
bytes as Base64 encoded Strings in the Metadata object. Advanced users could
then have access to the raw XMP streams.
For Tika 1.x, unless users configured it, nothing would call it. For Tika 2.x,
once we get the combo configurable parsers set up, a user could configure a
combo/additive parser, e.g., a PDFParser that is a combination of our current
PDFParser and then this new XMPScannerParser.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)