Hi all, I am fairly new to OCRopus and OCR in general. My recent work has been in language modeling for automatic speech recognition (ASR).
We recently started a new project in which we'll be applying various language modeling techniques from our work on ASR, to OCR. OCRopus seems to be best suited for this since it already supports language modeling through OpenFST. The first thing I'm interested in experimenting with, is post-processsing of OCR recognition lattices (e.g. re-scoring those lattices with other language models). I've managed to get OCRopus to output lattices in the latest development HG-checkout, with the ocropy command "ocropus-linerec" (which it seems to output by default). However, as far as I can tell the binary OpenFST files do not contain embedded symbol tables. I.e. if I convert the binary FST to a text FST, I get something like this: 0 1 65537 33 16.7413559 0 24 65537 33 11.7413559 0 1 65537 49 17.3978558 0 25 65537 49 12.3978558 I believe the number 65537 is an "input label" and 33 and 49 are "output labels". I am guessing that the input labels are image segment ID's, and the output labels are the ID's of letters, or sequences of letters? If that's correct, what I am most interested in is how to access the letters corresponding to each "output" ID. Is there any way to do this and/or all this feature? Best, Ben -- Benjamin Lambert Ph.D. Student of Computer Science Carnegie Mellon University www.cs.cmu.edu/~belamber Mobile: 617-869-1844 -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
