Re: Image pre-processing for good OCR results

2011-02-23 Thread TP
On Sun, Feb 20, 2011 at 6:02 PM, Jon Andersen jande...@gmail.com wrote:
 Hi,
 My project at http://RecordAGrave.com is about recording headstones from
 graves and posting the text and images on the Net so that people can
 research their family history.  I would appreciate some advice on how to
 pre-process these headstone images to get the best results from Tesseract
 OCR.  I have thousands of 1-2 MB jpg images of headstones to process.
 Example images:
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg
 I am a software developer so I can script up pre-processing steps to prepare
 the input for Tesseract.
 Any advice on improving OCR accuracy through pre-processing steps?
 Thanks so much,

 -Jon

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


I guess I'm a bit surprised that no one has yet mentioned the fact
that the Leptonica C Image Processing Library
(http://www.leptonica.com) is now required to build tesseract-ocr --
or soon will be... the current state of tesseract-ocr is a bit hazy.
My understanding is that eventually (not in the near future though)
tesseract-ocr will only use Leptonica PIXs as its in-memory image
representation.

A still unofficial, easier to read, Sphinx generated version of the
Leptonica documentation is at
http://tpgit.github.com/UnOfficialLeptDocs/. Dan is currently
hammering away at v1.68 and it should be out soon (this week?). At
which point I'll also update my unofficial version of the
documentation.

My admittedly quick/biased opinion was that OpenCV focused on Computer
Vision and that Leptonica has more pure Image Processing routines. I
also find Leptonica's source code fairly easy to read because one of
the purposes of the library is to try to teach image processing
concepts.

In any case, if you're planning on using tesseract-ocr 3.x, then you
already must have liblept, so you might as well try it out.

-- TP

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Image pre-processing for good OCR results

2011-02-22 Thread Tom Morris
On Feb 20, 9:02 pm, Jon Andersen jande...@gmail.com wrote:

 My project athttp://RecordAGrave.comis about recording headstones from
 graves and posting the text and images on the Net so that people can
 research their family history.  I would appreciate some advice on how to
 pre-process these headstone images to get the best results from Tesseract
 OCR.  I have thousands of 1-2 MB jpg images of headstones to process.

Post-image capture is too late for one of the most important
enhancements, namely high contrast lighting.  It's not really an issue
with stones that have the carving painted or are otherwise naturally
high contrast, but for many stones sharp oblique lighting is important
to get an image that's readable by humans, let alone OCR software.

Once you've got the best quality image capture you can manage, you'll
probably find that you need to use different image processing
pipelines for different types of stones and carving, so the first step
will be to categorize the stone and figure out which pipeline to run
it through (or run it through them all and compare the results).

In addition to image processing, you may also be able to improve
results by making use of the fact that the vocabulary and layout of
the text is much more constrained than free text.

It'll be interesting to see what kind of results you get.  I suspect
it's going to be a fairly challenging project for the general case,
but you may be able to pick of the low hanging fruit and gradually
expand the types of stones you can handle.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Image pre-processing for good OCR results

2011-02-22 Thread Jon Andersen
Vicky,

I may be able to convert your local-minima code to OpenCV code; can you send
me the result files as well as the filter?

I wrote some Python code that uses OpenCV to crop the headstone images to
show just the stone.  Its not perfect, but it works OK.  The Hough algorithm
and the other corner-detection algorithms weren't working at all for me.  So
I just thresholded based on the average saturation value, row-by-row,
column-by-column, to find a rectangle that was saturated enough.  Then crop
to that rectangle.  Overly simple and dumb; however, it does somewhat work,
whereas the other algorithms just gave me insane corners and didn't detect
the headstone at all.

Reference images:
http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa/

Thanks!!

-Jon Andersen
Software engineer
Citrix Systems, Inc
954-973-4908 (home)


On Mon, Feb 21, 2011 at 11:47 PM, Vicky Budhiraja vicky.vi...@gmail.comwrote:

 Hi Jon,

 The code I have written is in MATLAB. Will you be able to convert it into
 OpenCV code? Lemme know.

 In OpenCV if you apply simple thresholding, it should work. My method
 (local-minima) is a little complicated (and accurate) then simple
 thresholding. Therefore, hard to implement in C++ because of interpolation
 step. I think OpenCV can do this, but we need to have a closer look for
 this
 step.

 Best Regards,
 Vicky


 -Original Message-
 From: Jon Andersen [mailto:jande...@gmail.com]
 Sent: Monday, February 21, 2011 23:42
 To: Vicky Budhiraja
 Subject: Re: Image pre-processing for good OCR results

 Vicky,

 Thank you so much for responding!  I appreciate your help with this
 project.

 I have taken thousands of photos of headstones, and am trying to use
 Tesseract on them.  I will make the results available through
 findagrave.com, so that people can search for their relatives.

 Here is a whole directory of sample images:

 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa/

 Could you send me the code or results that you found?  I am trying to
 use OpenCV to do the image pre-processing.

 Thanks!!!

 -Jon

 Vicky Budhiraja wrote:
  Hi Jon,
 
 
 
  Like each morning, I check my emails and I saw those headstones Images
 from
  Graves. I am a God fearing person. So, I was not able to ignore your
 email.
 
 
 
  Regarding the preprocessing step, I suggest to apply Local Minima method
 for
  background removal. However, you might require to adjust your window size
 in
  order to achieve the best results. I did some experiments with the MATLAB
  code, and I got some good results. Testing on a larger sample set, may
  improve the step.
 
 
 
  Please tell me what project you are working on, maybe I will be able to
  contribute better? Just lemme know if you need any type of help!
 
 
 
  Best Regards,
 
  Vicky
 
 
 
 
 
 
 
  From: tesseract-ocr@googlegroups.com
 [mailto:tesseract-ocr@googlegroups.com]
  On Behalf Of Jon Andersen
  Sent: Monday, February 21, 2011 07:32
  To: tesseract-ocr@googlegroups.com
  Subject: Image pre-processing for good OCR results
 
 
 
  Hi,
 
 
 
  My project at http://RecordAGrave.com is about recording headstones from
  graves and posting the text and images on the Net so that people can
  research their family history.  I would appreciate some advice on how to
  pre-process these headstone images to get the best results from Tesseract
  OCR.  I have thousands of 1-2 MB jpg images of headstones to process.
 
 
 
  Example images:
 
 

 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
 

 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg
 
 

 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
 

 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg
 
 

 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
 

 0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg
 
  I am a software developer so I can script up pre-processing steps to
 prepare
  the input for Tesseract.
 
 
 
  Any advice on improving OCR accuracy through pre-processing steps?
 
 
 
  Thanks so much,
 
 
 
  -Jon
 
  --
  You received this message because you are subscribed to the Google Groups
  tesseract-ocr group.
  To post to this group, send email to tesseract-ocr@googlegroups.com.
  To unsubscribe from this group, send email to
  tesseract-ocr+unsubscr...@googlegroups.com.
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http

RE: Image pre-processing for good OCR results

2011-02-22 Thread Cong Nguyen
Dear Jon,

 

Beginning for analyzing; I try also to detect lines, corners; but results
are not good. I think due to images are low contrast.

 

Please try to analyze with some data line profiles:

 

ROI-left-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706091073985
362

 

ROI-top-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706094761082
706

 

ROI-right-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706102033630
978

 

ROI-bottom-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706106389606
898

 

After doing ROI detection, may be you need to align image.

My solution for this step is: 

-  detect all lines (Hough transform approach), and then keep all
lines have slops are similar to horizontal lines.

-  Estimate base-slop based on mean slop

-  Align image

Here are detected lines:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576709473940745
778

 

Hope it's helpful to you!

 

Good luck,

Cong.

 

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Image pre-processing for good OCR results

2011-02-21 Thread Jon Andersen
Whoops, sorry - links were broken for a bit.  I just fixed the image links, 
they should work now.

Thanks!!

-Jon

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Image pre-processing for good OCR results

2011-02-21 Thread Kip Hughes
Hi Vicky,

I have an interest in theology and just wanted to know which of the god(s)
are you god fearing of? In my experience, the phrase god fearing has
been used predominantly by Christians. I checked your LinkedIn profile and
confirmed you are from India.

Less than 3% of Indians are Christians -- so, based on this statistic, I
would guess you are not a Christian. Over 80% of Indians are Hindus -- and
if I had to make a guess about any Indian's religion, I would go with that
one. Are you a Hindu? Hinduism a polytheistic religion, isn't it? Why would
you only be a God fearing person versus gods fearing person?

Finally, is there some significance that headstones have in your religion
(whatever it may be) that made you unable to ignore Jon's email?

Hope you don't mind the questions. They are really just due to my interest
in world religions and world views.

Thanks,
KIP

On Mon, Feb 21, 2011 at 5:14 PM, Vicky Budhiraja vicky.vi...@gmail.comwrote:

  Hi Jon,



 Like each morning, I check my emails and I saw those headstones Images from
 Graves. I am a God fearing person. So, I was not able to ignore your email.



 Regarding the preprocessing step, I suggest to apply Local Minima method
 for background removal. However, you might require to adjust your window
 size in order to achieve the best results. I did some experiments with the
 MATLAB code, and I got some good results. Testing on a larger sample set,
 may improve the step.



 Please tell me what project you are working on, maybe I will be able to
 contribute better? Just lemme know if you need any type of help!



 Best Regards,

 Vicky







 *From:* tesseract-ocr@googlegroups.com [mailto:
 tesseract-ocr@googlegroups.com] *On Behalf Of *Jon Andersen
 *Sent:* Monday, February 21, 2011 07:32
 *To:* tesseract-ocr@googlegroups.com
 *Subject:* Image pre-processing for good OCR results



 Hi,



 My project at http://RecordAGrave.com is about recording headstones from
 graves and posting the text and images on the Net so that people can
 research their family history.  I would appreciate some advice on how to
 pre-process these headstone images to get the best results from Tesseract
 OCR.  I have thousands of 1-2 MB jpg images of headstones to process.



 Example images:


 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpghttp://freepages.genealogy.rootsweb.ancestry.com/%7Ejanderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg


 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpghttp://freepages.genealogy.rootsweb.ancestry.com/%7Ejanderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg


 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpghttp://freepages.genealogy.rootsweb.ancestry.com/%7Ejanderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg

 I am a software developer so I can script up pre-processing steps to
 prepare the input for Tesseract.



 Any advice on improving OCR accuracy through pre-processing steps?



 Thanks so much,



 -Jon

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: Image pre-processing for good OCR results

2011-02-21 Thread Cong Nguyen
Dear Jon,

 

Try to analyze with some preprocessing steps as belows:

 

Step1: Detect ROI

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756516993
234

 

Setp2: Apply low-pass  fft  filter, with parameters:

- intensity threshold is 130

- fft cutoff: 15% 

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366759922523
650

 

Step3: Scale image with scale factor

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756371708
834

 

Step4: try to recognize use Tesseract/others

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605
922

 

Step5: post-processing requires

 

Good luck,

Cong.

 

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Jon Andersen
Sent: Monday, February 21, 2011 10:46 PM
To: tesseract-ocr@googlegroups.com
Subject: Re: Image pre-processing for good OCR results

 

Whoops, sorry - links were broken for a bit.  I just fixed the image links,
they should work now.

 

Thanks!!

 

-Jon

-- 
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



RE: Image pre-processing for good OCR results

2011-02-20 Thread Vicky Budhiraja
Hi Jon,

 

Like each morning, I check my emails and I saw those headstones Images from
Graves. I am a God fearing person. So, I was not able to ignore your email.

 

Regarding the preprocessing step, I suggest to apply Local Minima method for
background removal. However, you might require to adjust your window size in
order to achieve the best results. I did some experiments with the MATLAB
code, and I got some good results. Testing on a larger sample set, may
improve the step.

 

Please tell me what project you are working on, maybe I will be able to
contribute better? Just lemme know if you need any type of help!

 

Best Regards,

Vicky

 

 

 

From: tesseract-ocr@googlegroups.com [mailto:tesseract-ocr@googlegroups.com]
On Behalf Of Jon Andersen
Sent: Monday, February 21, 2011 07:32
To: tesseract-ocr@googlegroups.com
Subject: Image pre-processing for good OCR results

 

Hi,

 

My project at http://RecordAGrave.com is about recording headstones from
graves and posting the text and images on the Net so that people can
research their family history.  I would appreciate some advice on how to
pre-process these headstone images to get the best results from Tesseract
OCR.  I have thousands of 1-2 MB jpg images of headstones to process.

 

Example images:

http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg

http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg

http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg

I am a software developer so I can script up pre-processing steps to prepare
the input for Tesseract.

 

Any advice on improving OCR accuracy through pre-processing steps?

 

Thanks so much,

 

-Jon

-- 
You received this message because you are subscribed to the Google Groups
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.



Re: Image pre-processing for good OCR results

2011-02-20 Thread Dmitry Silaev
Jon,

I don't know if it's intended but all your links to images report
We're sorry. The page you tried to access is not available. In that
way nothing can be advised on your issue...

Warm regards,
Dmitry Silaev





On Mon, Feb 21, 2011 at 5:02 AM, Jon Andersen jande...@gmail.com wrote:
 Hi,
 My project at http://RecordAGrave.com is about recording headstones from
 graves and posting the text and images on the Net so that people can
 research their family history.  I would appreciate some advice on how to
 pre-process these headstone images to get the best results from Tesseract
 OCR.  I have thousands of 1-2 MB jpg images of headstones to process.
 Example images:
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28215.jpg
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28216.jpg
 http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%20of%20David%20Memorial%20Gardens/Garden%20of%20Haifa%20-%20Raw/IMG_28217.jpg
 I am a software developer so I can script up pre-processing steps to prepare
 the input for Tesseract.
 Any advice on improving OCR accuracy through pre-processing steps?
 Thanks so much,

 -Jon

 --
 You received this message because you are subscribed to the Google Groups
 tesseract-ocr group.
 To post to this group, send email to tesseract-ocr@googlegroups.com.
 To unsubscribe from this group, send email to
 tesseract-ocr+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/tesseract-ocr?hl=en.


-- 
You received this message because you are subscribed to the Google Groups 
tesseract-ocr group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.