> > Where does this version of the code stand relative to production > quality code? Is it getting close or still a long way or ...?
Right now, preprocessing and layout analysis are the least reliable parts: they work well on the kinds of documents they were designed for (books and journal articles scanned at 300-600dpi on a flatbed scanner), but noise, distortions, and other resolutions make them fail fairly easily. Commercial OCR systems achieve their performance through careful engineering and testing, but we don't have engineers that can do that. Instead, we're looking to machine learning to solve these problems, and that makes the problem harder, but hopefully in the long run leads to better solutions. In any case, there are already a bunch of much improved modules in the pipeline that we are planning this year that should help make OCRopus more robust and suitable to more applications. > The FST stage of the processing, in particular, > seems incredibly resource heavy without really doing much improvement > of the raw text generated by earlier stages. > That's one of the reasons the language modeling has been refactored. OCRopus 0.5 outputs its recognition lattices in a simple text file format now, making it easy to replace FST-based modeling entirely with other language modeling approaches. Tom -- You received this message because you are subscribed to the Google Groups "ocropus" group. To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/bfvNKwFHC28J. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
