[jira] [Commented] (TIKA-3384) Convert new transcribe package to a Parser along the lines of OCR?

Tim Allison (Jira) Tue, 18 May 2021 04:49:04 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346837#comment-17346837
 ]


Tim Allison commented on TIKA-3384:
-----------------------------------

ARGH... I noticed after I hit commit that the commit history was lost.  What 
looked like a git move in my editor became a git delete+add when I went to make 
the commits.  I reverted and redid the work from the commandline with {{git 
mv}} and got the same results. 

[~lewismc], please take a look and let me know what you think.  I can revert 
this and try a third time... single commit for git mv and then another for the 
updates?

> Convert new transcribe package to a Parser along the lines of OCR?
> ------------------------------------------------------------------
>
>                 Key: TIKA-3384
>                 URL: https://issues.apache.org/jira/browse/TIKA-3384
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> This is a proposal to convert [~lewismc] et al's awesome new transcribe code 
> into a parser along the lines of Tesseract.  
> In 2.x, I inverted the call order from 1.x.  The image parsers now look to 
> see if there's a parser that supports a pseudo mime, like {{image/ocr-jpeg}}, 
> if there is, then they apply that parser to the stream.  We could do the same 
> thing with media files that the new transcription package supports.  
> For those who want only ocr/transcription, they can turn off the image 
> parsers and then decorate the OCR parser, for example, with {{supports 
> "image/jpeg"}} and that parser will be called directly.
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3384) Convert new transcribe package to a Parser along the lines of OCR?

Reply via email to