Re: Image pre-processing for good OCR results
On Sun, Feb 20, 2011 at 6:02 PM, Jon Andersen jande...@gmail.com wrote: Hi, My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Example images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract. Any advice on improving OCR accuracy through pre-processing steps? Thanks so much, -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. I guess I'm a bit surprised that no one has yet mentioned the fact that the Leptonica C Image Processing Library (http://www.leptonica.com) is now required to build tesseract-ocr -- or soon will be... the current state of tesseract-ocr is a bit hazy. My understanding is that eventually (not in the near future though) tesseract-ocr will only use Leptonica PIXs as its in-memory image representation. A still unofficial, easier to read, Sphinx generated version of the Leptonica documentation is at http://tpgit.github.com/UnOfficialLeptDocs/. Dan is currently hammering away at v1.68 and it should be out soon (this week?). At which point I'll also update my unofficial version of the documentation. My admittedly quick/biased opinion was that OpenCV focused on Computer Vision and that Leptonica has more pure Image Processing routines. I also find Leptonica's source code fairly easy to read because one of the purposes of the library is to try to teach image processing concepts. In any case, if you're planning on using tesseract-ocr 3.x, then you already must have liblept, so you might as well try it out. -- TP -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Image pre-processing for good OCR results
On Feb 20, 9:02 pm, Jon Andersen jande...@gmail.com wrote: My project athttp://RecordAGrave.comis about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Post-image capture is too late for one of the most important enhancements, namely high contrast lighting. It's not really an issue with stones that have the carving painted or are otherwise naturally high contrast, but for many stones sharp oblique lighting is important to get an image that's readable by humans, let alone OCR software. Once you've got the best quality image capture you can manage, you'll probably find that you need to use different image processing pipelines for different types of stones and carving, so the first step will be to categorize the stone and figure out which pipeline to run it through (or run it through them all and compare the results). In addition to image processing, you may also be able to improve results by making use of the fact that the vocabulary and layout of the text is much more constrained than free text. It'll be interesting to see what kind of results you get. I suspect it's going to be a fairly challenging project for the general case, but you may be able to pick of the low hanging fruit and gradually expand the types of stones you can handle. Tom -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Image pre-processing for good OCR results
Vicky, I may be able to convert your local-minima code to OpenCV code; can you send me the result files as well as the filter? I wrote some Python code that uses OpenCV to crop the headstone images to show just the stone. Its not perfect, but it works OK. The Hough algorithm and the other corner-detection algorithms weren't working at all for me. So I just thresholded based on the average saturation value, row-by-row, column-by-column, to find a rectangle that was saturated enough. Then crop to that rectangle. Overly simple and dumb; however, it does somewhat work, whereas the other algorithms just gave me insane corners and didn't detect the headstone at all. Reference images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa/ Thanks!! -Jon Andersen Software engineer Citrix Systems, Inc 954-973-4908 (home) On Mon, Feb 21, 2011 at 11:47 PM, Vicky Budhiraja vicky.vi...@gmail.comwrote: Hi Jon, The code I have written is in MATLAB. Will you be able to convert it into OpenCV code? Lemme know. In OpenCV if you apply simple thresholding, it should work. My method (local-minima) is a little complicated (and accurate) then simple thresholding. Therefore, hard to implement in C++ because of interpolation step. I think OpenCV can do this, but we need to have a closer look for this step. Best Regards, Vicky -Original Message- From: Jon Andersen [mailto:jande...@gmail.com] Sent: Monday, February 21, 2011 23:42 To: Vicky Budhiraja Subject: Re: Image pre-processing for good OCR results Vicky, Thank you so much for responding! I appreciate your help with this project. I have taken thousands of photos of headstones, and am trying to use Tesseract on them. I will make the results available through findagrave.com, so that people can search for their relatives. Here is a whole directory of sample images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa/ Could you send me the code or results that you found? I am trying to use OpenCV to do the image pre-processing. Thanks!!! -Jon Vicky Budhiraja wrote: Hi Jon, Like each morning, I check my emails and I saw those headstones Images from Graves. I am a God fearing person. So, I was not able to ignore your email. Regarding the preprocessing step, I suggest to apply Local Minima method for background removal. However, you might require to adjust your window size in order to achieve the best results. I did some experiments with the MATLAB code, and I got some good results. Testing on a larger sample set, may improve the step. Please tell me what project you are working on, maybe I will be able to contribute better? Just lemme know if you need any type of help! Best Regards, Vicky From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Jon Andersen Sent: Monday, February 21, 2011 07:32 To: tesseract-ocr@googlegroups.com Subject: Image pre-processing for good OCR results Hi, My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Example images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract. Any advice on improving OCR accuracy through pre-processing steps? Thanks so much, -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http
RE: Image pre-processing for good OCR results
Dear Jon, Beginning for analyzing; I try also to detect lines, corners; but results are not good. I think due to images are low contrast. Please try to analyze with some data line profiles: ROI-left-profile: https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706091073985 362 ROI-top-profile: https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706094761082 706 ROI-right-profile: https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706102033630 978 ROI-bottom-profile: https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706106389606 898 After doing ROI detection, may be you need to align image. My solution for this step is: - detect all lines (Hough transform approach), and then keep all lines have slops are similar to horizontal lines. - Estimate base-slop based on mean slop - Align image Here are detected lines: https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576709473940745 778 Hope it's helpful to you! Good luck, Cong. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Image pre-processing for good OCR results
Whoops, sorry - links were broken for a bit. I just fixed the image links, they should work now. Thanks!! -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Image pre-processing for good OCR results
Hi Vicky, I have an interest in theology and just wanted to know which of the god(s) are you god fearing of? In my experience, the phrase god fearing has been used predominantly by Christians. I checked your LinkedIn profile and confirmed you are from India. Less than 3% of Indians are Christians -- so, based on this statistic, I would guess you are not a Christian. Over 80% of Indians are Hindus -- and if I had to make a guess about any Indian's religion, I would go with that one. Are you a Hindu? Hinduism a polytheistic religion, isn't it? Why would you only be a God fearing person versus gods fearing person? Finally, is there some significance that headstones have in your religion (whatever it may be) that made you unable to ignore Jon's email? Hope you don't mind the questions. They are really just due to my interest in world religions and world views. Thanks, KIP On Mon, Feb 21, 2011 at 5:14 PM, Vicky Budhiraja vicky.vi...@gmail.comwrote: Hi Jon, Like each morning, I check my emails and I saw those headstones Images from Graves. I am a God fearing person. So, I was not able to ignore your email. Regarding the preprocessing step, I suggest to apply Local Minima method for background removal. However, you might require to adjust your window size in order to achieve the best results. I did some experiments with the MATLAB code, and I got some good results. Testing on a larger sample set, may improve the step. Please tell me what project you are working on, maybe I will be able to contribute better? Just lemme know if you need any type of help! Best Regards, Vicky *From:* tesseract-ocr@googlegroups.com [mailto: tesseract-ocr@googlegroups.com] *On Behalf Of *Jon Andersen *Sent:* Monday, February 21, 2011 07:32 *To:* tesseract-ocr@googlegroups.com *Subject:* Image pre-processing for good OCR results Hi, My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Example images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpghttp://freepages.genealogy.rootsweb.ancestry.com/%7Ejanderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpghttp://freepages.genealogy.rootsweb.ancestry.com/%7Ejanderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpghttp://freepages.genealogy.rootsweb.ancestry.com/%7Ejanderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract. Any advice on improving OCR accuracy through pre-processing steps? Thanks so much, -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
RE: Image pre-processing for good OCR results
Dear Jon, Try to analyze with some preprocessing steps as belows: Step1: Detect ROI https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756516993 234 Setp2: Apply low-pass fft filter, with parameters: - intensity threshold is 130 - fft cutoff: 15% https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366759922523 650 Step3: Scale image with scale factor https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756371708 834 Step4: try to recognize use Tesseract/others https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605 922 Step5: post-processing requires Good luck, Cong. From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Jon Andersen Sent: Monday, February 21, 2011 10:46 PM To: tesseract-ocr@googlegroups.com Subject: Re: Image pre-processing for good OCR results Whoops, sorry - links were broken for a bit. I just fixed the image links, they should work now. Thanks!! -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
RE: Image pre-processing for good OCR results
Hi Jon, Like each morning, I check my emails and I saw those headstones Images from Graves. I am a God fearing person. So, I was not able to ignore your email. Regarding the preprocessing step, I suggest to apply Local Minima method for background removal. However, you might require to adjust your window size in order to achieve the best results. I did some experiments with the MATLAB code, and I got some good results. Testing on a larger sample set, may improve the step. Please tell me what project you are working on, maybe I will be able to contribute better? Just lemme know if you need any type of help! Best Regards, Vicky From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com] On Behalf Of Jon Andersen Sent: Monday, February 21, 2011 07:32 To: tesseract-ocr@googlegroups.com Subject: Image pre-processing for good OCR results Hi, My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Example images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract. Any advice on improving OCR accuracy through pre-processing steps? Thanks so much, -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
Re: Image pre-processing for good OCR results
Jon, I don't know if it's intended but all your links to images report We're sorry. The page you tried to access is not available. In that way nothing can be advised on your issue... Warm regards, Dmitry Silaev On Mon, Feb 21, 2011 at 5:02 AM, Jon Andersen jande...@gmail.com wrote: Hi, My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history. I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR. I have thousands of 1-2 MB jpg images of headstones to process. Example images: http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract. Any advice on improving OCR accuracy through pre-processing steps? Thanks so much, -Jon -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups tesseract-ocr group. To post to this group, send email to tesseract-ocr@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.