[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216444#comment-14216444 ]
Nick Burch commented on TIKA-1445: ---------------------------------- I think it's fairly common for people to have 4-5 parser services files, and whatever we do needs to accept that as a "normal" use case. Pretty much anyone depending on tika-parsers is going to have at least 2. Think of the case of {code:title="tika-parsers.jar:META-INF/services/org.apache.tika.parser.Parser"} org.apache.tika.parser.gdal.GDALParser org.apache.tika.parser.html.HtmlParser org.apache.tika.parser.image.ImageParser {code} {code:title="my-tika-extension.jar:META-INF/services/org.apache.tika.parser.Parser"} com.example.tika.ocr.customocrparser org.apache.tika.parser.image.ImageParser {code} Under your plan, given that the JVM could return the two service files to you in any order, how do you decide which of GDALParser or ImageParser goes second after the OCR one? In one parser file, Image comes first, in the other it's second. Which wins? How do we make it deterministic, and not just based on which jar the JVM spots first? > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, > TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)