[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Nick Burch (JIRA) Sun, 23 Nov 2014 13:35:39 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222510#comment-14222510
 ]


Nick Burch commented on TIKA-1445:
----------------------------------

I quite like Tim's idea. We can have things like 
{{TikaConfig.getDefaultConfig()}}, {{TikaConfig.getMaxiumMetadataConfig()}}, 
{{TikaConfig.getTryEachInTurnConfig()}} etc. People with specific needs can 
either pass those in as options to a TikaConfig constructor, or they can 
provide a tika config xml file that lists their preferences, perhaps with an 
expanded syntax like
{code}
<parser class="composite">
  <childparser>org.apache.tika.parser.jpeg.JPegParser</childparser>
  <childparser>...</childparser>
  <childparser>...</childparser>
  <childparser>org.apache.tika.parser.ocr.TesseractOCR</childparser>
</parser>
<parser class="tryinturn">
  <childparser>org.apache.tika.text</childparser>
  <childparser>org.apache.tika.text.findtextstrings</childparser>
</parser>
<parser class="defaultparser">
  <exclude>org.apache.tika.netcdf</exclude>
</parser>
{code}

The above slightly pseudocode example would try to merge all the image parsers 
output in turn, would for plain text try the normal parser then fall back to 
the talked-about "bit like strings" if that failed, and would use the default 
parser for everything else but excluding the netcdf parser

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to