[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212246#comment-14212246 ]
Tim Allison edited comment on TIKA-1445 at 11/14/14 1:52 PM: ------------------------------------------------------------- The AutoDetectParser was doing its regular lookup for which parser supported x file type. No luck in that. Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type. Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of "prefer" non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type. >From the DefaultParser: {noformat} List<Parser> parsers = loader.loadStaticServiceProviders(Parser.class); Collections.sort(parsers, new Comparator<Parser>() { public int compare(Parser p1, Parser p2) { String n1 = p1.getClass().getName(); String n2 = p2.getClass().getName(); boolean t1 = n1.startsWith("org.apache.tika."); boolean t2 = n2.startsWith("org.apache.tika."); if (t1 == t2) { return n1.compareTo(n2); } else if (t1) { return -1; } else { return 1; } } }); {noformat} and from CompositeParser: {noformat} public Map<MediaType, Parser> getParsers(ParseContext context) { Map<MediaType, Parser> map = new HashMap<MediaType, Parser>(); for (Parser parser : parsers) { for (MediaType type : parser.getSupportedTypes(context)) { map.put(registry.normalize(type), parser); } } return map; } {noformat} The "luck" so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers. If you run the GDALParser on "/test-documents/testJPEG_EXIF.jpg", you get no metadata. :( Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it: 1) selects non-o.a.t. parsers first 2) respects the order of parsers in the services files This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order. was (Author: talli...@mitre.org): The AutoDetectParser was doing its regular lookup for which parser supported x file type. No luck in that. Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type. Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of "prefer" non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type. >From the DefaultParser: {noformat} List<Parser> parsers = loader.loadStaticServiceProviders(Parser.class); Collections.sort(parsers, new Comparator<Parser>() { public int compare(Parser p1, Parser p2) { String n1 = p1.getClass().getName(); String n2 = p2.getClass().getName(); boolean t1 = n1.startsWith("org.apache.tika."); boolean t2 = n2.startsWith("org.apache.tika."); if (t1 == t2) { return n1.compareTo(n2); } else if (t1) { return -1; } else { return 1; } } }); {noformat} and {noformat} if (loader != null) { // Add dynamic parser service (they always override static ones) MediaTypeRegistry registry = getMediaTypeRegistry(); List<Parser> parsers = loader.loadDynamicServiceProviders(Parser.class); Collections.reverse(parsers); // best parser last for (Parser parser : parsers) { for (MediaType type : parser.getSupportedTypes(context)) { map.put(registry.normalize(type), parser); } } } {noformat} The "luck" so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers. If you run the GDALParser on "/test-documents/testJPEG_EXIF.jpg", you get no metadata. :( Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it: 1) selects non-o.a.t. parsers first 2) respects the order of parsers in the services files This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order. > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, > TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)