[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215303#comment-14215303
 ] 

Chris A. Mattmann commented on TIKA-1445:
-----------------------------------------

Hey [[email protected]]:

Here are my replies (also I moved this convo to the dev list since I think it's 
super important!):

{noformat}
#1 We will use a default policy of “append” which allows the Metadata
object to append values to the same key, rather than replace them.
We could also couple this with X-Parsed-By, which is an ordered
list of what Parser parsed what so that we can reconstruct what
Parser contributed what field. If it’s multi-valued, we can also
add fields for Offsets, etc.  An alternative here would also be to
prefix metadata keys in this CompositeParser by the X-Parsed-By
parser name, to avoid conflicts. Users would be able to switch the
policy from “append” to “overwrite” in which this isn’t a problem,
and we simply allow the last parser to input into a conflicting key
to be the one that takes precedence. One option with overwrite would
be to allow in this policy for providing a precedence order of
Parsers (e.g., the current service list could be a precedence order).

That said, how sure are we that this is a *real* problem? Some
parsers parse the same MediaType but contribute vastly different
and non overlapping keys to the metadata object?

#2 I like your suggestion - or the alternative as I suggested would
be to reset the stream to the beginning after each parser, or
alternatively keep a clone of the original stream as a copy, and
then clone it for each called Parser attempt?

#3 I like your idea about wrapping content provided by handlers
with the parser attribute. Very neat, let’s try that!

{noformat}


> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to