uHi,

On Mi, 2014-11-26 at 10:36 -0500, Matthew Roy wrote:
> I'm interested in developing some simple support for tesseract OCR of
> free-response blocks (principally numbers). Currently I'm using
> tesseract for post-processing CSV export images, but it would be useful
> to move this into the SDAPS python code that recognizes textboxes using
> python-tesseract.
> 
> Does anyone have thoughts regarding the best way to approach this so
> that it can be merged back into SDAPS, rather than just being a private
> fork?

My initial thought about this was, that it would make sense to create a
special OCR "box" type. It would be only slightly different to a normal
textbox:
 * It might have some marks to guide the user (there is even LaTeX code
for this in the "ocr" branch, if you want to look). Just adding slight
ticks to force block writing.
 * Just a subclass of the normal textbox. This is only required to
enable the OCR step. At some further point it might also store
wordlists, etc. for heuristics (e.g. a list of languages for language
fields)

I am not sure how much of this is already implemented in the "ocr"
branch. I think the LaTeX code is pretty much there, and adding the
subclass is relatively straight forward.


Technically I am right now thinking that it might make sense to split
this out into a new package. I am mostly concerned with two things:
 1. dependencies
 2. further development

I don't want python-tesseract to be a hard dependency for "recognize" if
the user does not have any OCR fields. So it should only be loaded if an
OCR field exists or a graceful fallback needs to be in place.

The second point is that in general it seems to me that tesseract (while
maybe pretty good) is not an ideal solution for handwriting recognition.
This means that we might want to add support for other OCR methods at a
later point. Putting the OCR code into a separate package and loading
tesseract based on an attribute of the OCR box might work nicely.


I hope this clears things up a bit,
Benjamin

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to