[jira] [Created] (TIKA-3384) Convert new transcribe package to a Parser along the lines of OCR?

Tim Allison (Jira) Tue, 04 May 2021 14:26:09 -0700

Tim Allison created TIKA-3384:
---------------------------------

             Summary: Convert new transcribe package to a Parser along the 
lines of OCR?
                 Key: TIKA-3384
                 URL: https://issues.apache.org/jira/browse/TIKA-3384
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison



This is a proposal to convert [~lewismc] et al's awesome new transcribe code 
into a parser along the lines of Tesseract.  

In 2.x, I inverted the call order from 1.x.  The image parsers now look to see 
if there's a parser that supports a pseudo mime, like {{image/ocr-jpeg}}, if 
there is, then they apply that parser to the stream.  We could do the same 
thing with media files that the new transcription package supports.  

For those who want only ocr/transcription, they can turn off the image parsers 
and then decorate the OCR parser, for example, with {{supports "image/jpeg"}} 
and that parser will be called directly.

What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (TIKA-3384) Convert new transcribe package to a Parser along the lines of OCR?

Reply via email to