uHi, On Mi, 2014-11-26 at 10:36 -0500, Matthew Roy wrote: > I'm interested in developing some simple support for tesseract OCR of > free-response blocks (principally numbers). Currently I'm using > tesseract for post-processing CSV export images, but it would be useful > to move this into the SDAPS python code that recognizes textboxes using > python-tesseract. > > Does anyone have thoughts regarding the best way to approach this so > that it can be merged back into SDAPS, rather than just being a private > fork?
My initial thought about this was, that it would make sense to create a special OCR "box" type. It would be only slightly different to a normal textbox: * It might have some marks to guide the user (there is even LaTeX code for this in the "ocr" branch, if you want to look). Just adding slight ticks to force block writing. * Just a subclass of the normal textbox. This is only required to enable the OCR step. At some further point it might also store wordlists, etc. for heuristics (e.g. a list of languages for language fields) I am not sure how much of this is already implemented in the "ocr" branch. I think the LaTeX code is pretty much there, and adding the subclass is relatively straight forward. Technically I am right now thinking that it might make sense to split this out into a new package. I am mostly concerned with two things: 1. dependencies 2. further development I don't want python-tesseract to be a hard dependency for "recognize" if the user does not have any OCR fields. So it should only be loaded if an OCR field exists or a graceful fallback needs to be in place. The second point is that in general it seems to me that tesseract (while maybe pretty good) is not an ideal solution for handwriting recognition. This means that we might want to add support for other OCR methods at a later point. Putting the OCR code into a separate package and loading tesseract based on an attribute of the OCR box might work nicely. I hope this clears things up a bit, Benjamin
signature.asc
Description: This is a digitally signed message part
