[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217685#comment-14217685 ]
Dave Meikle commented on TIKA-1445: ----------------------------------- bq. Hey Guys, to be honest, the way I see that we solve the ServiceLoading problem is somehow to move away from it. Relying on the JVM to implicitly decide which parser to load based on ClassLoading is not scalable IMO. At worst, even capturing an ordered preference file that isn't ServiceLoading is 1000x better IMO than relying on the JVM and the classpath. We need somehow to bring this logic into Tika (still thinking about how and will try to prototype something). +1 - I think this is example of something we will probably hit more and more as we further extend Tika, i.e. wanting multiple parsers to have an interest in and then parse content of the same mime type, and moving away from using the re-ordering approach seems like the only way to go here. _ServiceLoading_ per se is not a problem, indeed this is a nice way to make it simple for external providers to be added, but I think we need to think about Parsers in a pipeline and allow users to customise the parsers that participate in the pipeline through positive exclusions via config. The above is a big change and I think if we went with something like this would need to be a 2.X of Tika. I suspect the problem with clashing Metadata entries is not really there, as most parsers look for different keys, or in cases where they process commons ones (e.g. title, size, description, etc) they should hopefully be getting the same value anyway. IMO I think we could send the same Metadata object through the 'pipeline', adding any unique new value in for a key. Will join the party and try to flesh out thoughts on a branch. bq. 3) It is a good idea to identify which parser produced each content with a <div> tag. +1 - this will be really helpful. > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, > TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)