[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222510#comment-14222510 ]
Nick Burch commented on TIKA-1445: ---------------------------------- I quite like Tim's idea. We can have things like {{TikaConfig.getDefaultConfig()}}, {{TikaConfig.getMaxiumMetadataConfig()}}, {{TikaConfig.getTryEachInTurnConfig()}} etc. People with specific needs can either pass those in as options to a TikaConfig constructor, or they can provide a tika config xml file that lists their preferences, perhaps with an expanded syntax like {code} <parser class="composite"> <childparser>org.apache.tika.parser.jpeg.JPegParser</childparser> <childparser>...</childparser> <childparser>...</childparser> <childparser>org.apache.tika.parser.ocr.TesseractOCR</childparser> </parser> <parser class="tryinturn"> <childparser>org.apache.tika.text</childparser> <childparser>org.apache.tika.text.findtextstrings</childparser> </parser> <parser class="defaultparser"> <exclude>org.apache.tika.netcdf</exclude> </parser> {code} The above slightly pseudocode example would try to merge all the image parsers output in turn, would for plain text try the normal parser then fall back to the talked-about "bit like strings" if that failed, and would use the default parser for everything else but excluding the netcdf parser > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, > TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)