Re: [Ankur-core] Bangla OCR progress

2009-05-09 Thread Deepayan Sarkar
On 5/9/09, Debayan Banerjee debaya...@gmail.com wrote:
 2009/5/9 Deepayan Sarkar deepayan.sar...@gmail.com:

  Debayan,
  
   I have been meaning to ask you: is your character segmentation
   algorithm in a form that could be easily separated out?

 The segmentation algorithm can be found here
  (http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf)

But this is your original algorithm which segmented গ etc (at least
for some fonts), isn't it? I thought you had an improved algorithm
which works around some of those problems (or maybe I misunderstood
your mail).

  If it could be
   easily done, I would like to try it out in BOCRA. Unfortunately, I
   don't think I will have enough time in the near future to figure out
   how ocropus/tesseract does things.


 Kindly read the paragraph in this

 (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html)

 post regarding reducing number of character classes to be trained. I
  want to know if this is possible using BOCRA.

No it's not. From the beginning, my design for BOCRA was based on the
idea of on-the-fly training, because that's the only approach I
thought was feasible given the combination of non-standard fonts and
so many potential conjuncts. In most realistic examples, the number of
conjuncts is actually quite limited. After accounting for the most
common ones, the frequency of the rest are probably lower than normal
OCR error rate anyway.

-Deepayan

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2009-05-08 Thread Debayan Banerjee
2009/4/20 srhaque srha...@theiet.org:
 BTW, if you still need my test file with conjunct samples, here it is...

Thank you very much. They have proved *very helpful* :)
I preapred this
(http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html)
post with the help of your document.


-- 
Regards,
Debayan Banerjee

Support Free Software
http://deeproot.in

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2009-05-08 Thread srhaque
On Friday 08 May 2009, Debayan Banerjee wrote:
 2009/4/20 srhaque srha...@theiet.org:
  BTW, if you still need my test file with conjunct samples, here it is...

 Thank you very much. They have proved *very helpful* :)
 I preapred this
 (http://hacking-tesseract.blogspot.com/2009/05/bengali-stats.html)
 post with the help of your document.

Cool. If it is of any use, then note that my Raga font also has glyphs for all 
the conjuncts (though I've not anything with the advanced tables to refine the 
font generally).

I've been thinking about OCR for a little while too, and am doing some little 
experiments here and there based on trying to apply brute force to simple 
algorithms for deskewing/text-block extraction/segmentation. However, I'm a 
bit stuck for inspiration on that front for now, so if there is anything I can 
do to help *you*, please let me know.

Thanks, Shaheed

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2009-05-08 Thread Deepayan Sarkar
Debayan,

I have been meaning to ask you: is your character segmentation
algorithm in a form that could be easily separated out? If it could be
easily done, I would like to try it out in BOCRA. Unfortunately, I
don't think I will have enough time in the near future to figure out
how ocropus/tesseract does things.

-Deepayan

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2009-04-20 Thread srhaque
BTW, if you still need my test file with conjunct samples, here it is...


Copyright (c) 2007, 2008 S.R.Haque (srha...@theiet.org).
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.2
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
Texts.  A copy of the license is included in the section entitled GNU
Free Documentation License.

CHANGE LOG
2007-04-16  S.R.Haque   First released.
2008-01-06  S.R.Haque   Adding missing conjuncts of ঈ.

SAMPLE TEXT
জনগণমন-অধিনায়ক জয় হে ভারতভাগ্যবিধাতা!
পঞ্জাব সিন্ধু গুজরাট মরাঠা দ্রাবিড় উত্কল বঙ্গ
বিন্ধ্য হিমাচল যমুনা গঙ্গা উচ্ছলজলধিতরঙ্গ
তব শুভ নামে জাগে, তব শুভ আশিস মাগে,
গাহে তব জয়গাথা।
জনগণমঙ্গলদায়ক জয় হে ভারতভাগ্যবিধাতা!
জয় হে, জয় হে, জয় হে, জয় জয় জয়, জয় হে॥
জনগণমন-অধিনায়ক জয় হে ভারতভাগ্যবিধাতা!

UNICODE 5.0 BENGALI CHARACTER CODES
098x ঁ ং ঃ অ আ ই ঈ উ ঊ ঋ ঌ এ
099x ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট
09Ax ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য
09Bx র ল শ ষ স হ ় ঽ া ি
09Cx ী ু ূ ৃ ৄ ে ৈ ো ৌ ্ ৎ ৗ ড় ঢ় য়
09Dx ৠ ৡ ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯ ৰ ৱ ৲ ৳ ৴ ৵ ৶ ৷ ৸ ৹ ৺

ASCII
0020   !  # $ %  ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ;  =  ?
0030 @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _
0040 ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

PUNCTUATION AND SPECIAL SYMBOLS
ৢ ৣ ‘ ’ “ ”

VOWEL CONJUNCTS
Conjuncts of আ (া)
কা  খা  গা  ঘা  ঙা  চা  ছা  জা  ঝা  ঞা
টা  ঠা  ডা  ঢা  ণা  তা  থা  দা  ধা  না
পা  ফা  বা  ভা  মা  যা  রা  লা  শা  ষা
সা  হা  ড়া  ঢ়া  য়া  ৰা  ৱা
Conjuncts of ই (ি)
কি  খি  গি  ঘি  ঙি  চি  ছি  জি  ঝি  ঞি
টি  ঠি  ডি  ঢি  ণি  তি  থি  দি  ধি  নি
পি  ফি  বি  ভি  মি  যি  রি  লি  শি  ষি
সি  হি  ড়ি  ঢ়ি  য়ি  ৰি  ৱি
Conjuncts of ঈ (ী)
কী  খী  গী  ঘী  ঙী  চী  ছী  জী  ঝী  ঞী
টী  ঠী  ডী  ঢী  ণী  তী  থী  দী  ধী  নী
পী  ফী  বী  ভী  মী  যী  রী  লী  শী  ষী
সী  হী  ড়ী  ঢ়ী  য়ী  ৰী  ৱী
Conjuncts of উ (ু)
কু  খু  গু  ঘু  ঙু  চু  ছু  জু  ঝু  ঞু
টু  ঠু  ডু  ঢু  ণু  তু  থু  দু  ধু  নু
পু  ফু  বু  ভু  মু  যু  রু  লু  শু  ষু
সু  হু  ড়ু  ঢ়ু  য়ু  ৰু  ৱু
Conjuncts of ঊ (ূ)
কূ  খূ  গূ  ঘূ  ঙূ  চূ  ছূ  জূ  ঝূ  ঞূ
টূ  ঠূ  ডূ  ঢূ  ণূ  তূ  থূ  দূ  ধূ  নূ
পূ  ফূ  বূ  ভূ  মূ  যূ  রূ  লূ  শূ  ষূ
সূ  হূ  ড়ূ  ঢ়ূ  য়ূ  ৰূ  ৱূ
Conjuncts of ঋ (ৃ)
কৃ  খৃ  গৃ  ঘৃ  ঙৃ  চৃ  ছৃ  জৃ  ঝৃ  ঞৃ
টৃ  ঠৃ  ডৃ  ঢৃ  ণৃ  তৃ  থৃ  দৃ  ধৃ  নৃ
পৃ  ফৃ  বৃ  ভৃ  মৃ  যৃ  রৃ  লৃ  শৃ  ষৃ
সৃ  হৃ  ড়ৃ  ঢ়ৃ  য়ৃ  ৰৃ  ৱৃ
Conjuncts of ঌ (ৄ)
কৄ  খৄ  গৄ  ঘৄ  ঙৄ  চৄ  ছৄ  জৄ  ঝৄ  ঞৄ
টৄ  ঠৄ  ডৄ  ঢৄ  ণৄ  তৄ  থৄ  দৄ  ধৄ  নৄ
পৄ  ফৄ  বৄ  ভৄ  মৄ  যৄ  রৄ  লৄ  শৄ  ষৄ
সৄ  হৄ  ড়ৄ  ঢ়ৄ  য়ৄ  ৰৄ  ৱৄ
Conjuncts of এ (ে)
কে  খে  গে  ঘে  ঙে  চে  ছে  জে  ঝে  ঞে
টে  ঠে  ডে  ঢে  ণে  তে  থে  দে  ধে  নে
পে  ফে  বে  ভে  মে  যে  রে  লে  শে  ষে
সে  হে  ড়ে  ঢ়ে  য়ে  ৰে  ৱে
Conjuncts of ঐ (ৈ)
কৈ  খৈ  গৈ  ঘৈ  ঙৈ  চৈ  ছৈ  জৈ  ঝৈ  ঞৈ
টৈ  ঠৈ  ডৈ  ঢৈ  ণৈ  তৈ  থৈ  দৈ  ধৈ  নৈ
পৈ  ফৈ  বৈ  ভৈ  মৈ  যৈ  রৈ  লৈ  শৈ  ষৈ
সৈ  হৈ  ড়ৈ  ঢ়ৈ  য়ৈ  ৰৈ  ৱৈ
Conjuncts of ও (ো)
কো  খো  গো  ঘো  ঙো  চো  ছো  জো  ঝো  ঞো
টো  ঠো  ডো  ঢো  ণো  তো  থো  দো  ধো  নো
পো  ফো  বো  ভো  মো  যো  রো  লো  শো  ষো
সো  হো  ড়ো  ঢ়ো  য়ো  ৰো  ৱো
Conjuncts of ঔ (ৌ)
কৌ  খৌ  গৌ  ঘৌ  ঙৌ  চৌ  ছৌ  জৌ  ঝৌ  ঞৌ
টৌ  ঠৌ  ডৌ  ঢৌ  ণৌ  তৌ  থৌ  দৌ  ধৌ  নৌ
পৌ  ফৌ  বৌ  ভৌ  মৌ  যৌ  রৌ  লৌ  শৌ  ষৌ
সৌ  হৌ  ড়ৌ  ঢ়ৌ  য়ৌ  ৰৌ  ৱৌ

CONSONANT CONJUNCTS
With Hasanta (্)
ক্  খ্  গ্  ঘ্  ঙ্  চ্  ছ্  জ্  ঝ্  ঞ্
ট্  ঠ্  ড্  ঢ্  ণ্  ত্  থ্  দ্  ধ্  ন্
প্  ফ্  ব্  ভ্  ম্  য্  র্  ল্  শ্  ষ্
স্  হ্  ড়্  ঢ়্  য়্  ৰ্ 

Re: [Ankur-core] Bangla OCR progress

2009-04-18 Thread Debayan Banerjee
I take the liberty of top posting since i copied the mail's contents
from archives and bottom posting will require messing with the text
below to much. In reply to this particular line:
 It takes the old matra removal approach, and he's
facing the same problems I did (notice in his first example that গ is
segmented into 2 parts, and শু is not).

Kindly see 
http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690.

Below is the original conversation.

On 7/2/08, Golam Mortuza Hossain [EMAIL PROTECTED] wrote:
 On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED]

  This guy seems to be doing some interesting progress for a Bangla OCR
   - or more precisely, enabling Bangla in Tesseract.
   http://debayanin.googlepages.com/hackingtesseract

Cool. I had some interaction with the tesseract/ocropus folks, and it
sounded like a good base. It's nice that someone's actually doing
something with it. It takes the old matra removal approach, and he's
facing the same problems I did (notice in his first example that গ is
segmented into 2 parts, and শু is not). On the other hand, having
something that works even partly is a good start.

 Yes, it looks definitely interesting.

   Looks like he needs some more training data - can we provide him with some
 ?

 If I remember correctly, there was a sample file for testing completeness
  of Bengali fonts. Since it has all letters and conjuncts typed-in, the
  file might
  be useful for training Tesseract as well .

  Deepayan should be able to give some input here. He has working experience
  with R and may have some training sample as well.

Well, we have a bunch of unicode documents. For some of them, I have
print versions too, and can scan them if needed. A simpler approach
would be to render them using different fonts and take screenshots.

Apparently he also needs some box-files, whatever they are, which need
to be produced using tesseract. I haven't installed tesseract yet, and
will try, but let me know if anyone else manages.

-Deepayan




-- 
Be Intelligent, Use GNU/Linux

http://debayanin.googlepages.com/
http://debayan.wordpress.com
http://lug.nitdgp.ac.in

--
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2009-04-18 Thread Salahuddin Pasha

On Apr 19, 2009, at 5:16 AM, Debayan Banerjee wrote:

 I take the liberty of top posting since i copied the mail's contents
 from archives and bottom posting will require messing with the text
 below to much. In reply to this particular line:
  It takes the old matra removal approach, and he's
 facing the same problems I did (notice in his first example that গ  
 is
 segmented into 2 parts, and শু is not).

 Kindly see 
 http://picasaweb.google.com/debayanin/TesseractIndicOCR#5325782929614608690 
 .

 Below is the original conversation.

 On 7/2/08, Golam Mortuza Hossain [EMAIL PROTECTED] wrote:
 On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL  
 PROTECTED]

 This guy seems to be doing some interesting progress for a Bangla  
 OCR
 - or more precisely, enabling Bangla in Tesseract.
 http://debayanin.googlepages.com/hackingtesseract

 Cool. I had some interaction with the tesseract/ocropus folks, and it
 sounded like a good base. It's nice that someone's actually doing
 something with it. It takes the old matra removal approach, and he's
 facing the same problems I did (notice in his first example that গ  
 is
 segmented into 2 parts, and শু is not). On the other hand, having
 something that works even partly is a good start.

 Yes, it looks definitely interesting.

 Looks like he needs some more training data - can we provide him  
 with some
 ?

 If I remember correctly, there was a sample file for testing  
 completeness
 of Bengali fonts. Since it has all letters and conjuncts typed-in,  
 the
 file might
 be useful for training Tesseract as well .

 Deepayan should be able to give some input here. He has working  
 experience
 with R and may have some training sample as well.

 Well, we have a bunch of unicode documents. For some of them, I have
 print versions too, and can scan them if needed. A simpler approach
 would be to render them using different fonts and take screenshots.

 Apparently he also needs some box-files, whatever they are, which need
 to be produced using tesseract. I haven't installed tesseract yet, and
 will try, but let me know if anyone else manages.

 -Deepayan




Dear all,

  I was working with OCR for my university. I took most of the idea  
from bocra.sourceforge.net

It is written using graphicsmagick library  C++.  Any suggestion from  
you about matching alphabet.


Here is my progress
http://picasaweb.google.com/salahuddin66/OCR#


regards
salahuddin

salahuddin66.blogspot.com



 -- 
 Be Intelligent, Use GNU/Linux

 http://debayanin.googlepages.com/
 http://debayan.wordpress.com
 http://lug.nitdgp.ac.in

 --
 Stay on top of everything new and different, both inside and
 around Java (TM) technology - register by April 22, and save
 $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
 300 plus technical and hands-on sessions. Register today.
 Use priority code J9JMT32. http://p.sf.net/sfu/p
 ___
 Bengalinux-core mailing list
 Bengalinux-core@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/bengalinux-core


--
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2008-07-02 Thread Sankarshan (সঙ্কর্ষণ)
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sayamindu Dasgupta wrote:

| This guy seems to be doing some interesting progress for a Bangla OCR
| - or more precisely, enabling Bangla in Tesseract.
| http://debayanin.googlepages.com/hackingtesseract
| Looks like he needs some more training data - can we provide him with
some ?

As an aside, he is working with the Swatantra Malayalam Computing group
to fix OCR issues in ml_IN too.

And, I'd request someone to validate how much progress he is making in
terms of attaining accuracy.



- --



You see things; and you say 'Why?';
But I dream things that never were;
and I say 'Why not?' - George Bernard Shaw
www.linkedin.com/in/sankarshan



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkhrhSkACgkQXQZpNTcrCzOCZACgjLgyl75jk88pAnNcJqki8/zL
2YsAoIxueuNMbpoCKIK8yXFBVF1gr0M9
=S+gd
-END PGP SIGNATURE-

-
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2008-07-02 Thread Golam Mortuza Hossain
On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED]
 This guy seems to be doing some interesting progress for a Bangla OCR
 - or more precisely, enabling Bangla in Tesseract.
 http://debayanin.googlepages.com/hackingtesseract

Yes, it looks definitely interesting.

 Looks like he needs some more training data - can we provide him with some ?

If I remember correctly, there was a sample file for testing completeness
of Bengali fonts. Since it has all letters and conjuncts typed-in, the
file might
be useful for training Tesseract as well .

Deepayan should be able to give some input here. He has working experience
with R and may have some training sample as well.

Cheers,
Golam

--
http://gravity.psu.edu/~hossain/

-
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core


Re: [Ankur-core] Bangla OCR progress

2008-07-02 Thread Deepayan Sarkar
On 7/2/08, Golam Mortuza Hossain [EMAIL PROTECTED] wrote:
 On Wed, Jul 2, 2008 at 9:32 AM, Sayamindu Dasgupta [EMAIL PROTECTED]

  This guy seems to be doing some interesting progress for a Bangla OCR
   - or more precisely, enabling Bangla in Tesseract.
   http://debayanin.googlepages.com/hackingtesseract

Cool. I had some interaction with the tesseract/ocropus folks, and it
sounded like a good base. It's nice that someone's actually doing
something with it. It takes the old matra removal approach, and he's
facing the same problems I did (notice in his first example that গ is
segmented into 2 parts, and শু is not). On the other hand, having
something that works even partly is a good start.

 Yes, it looks definitely interesting.

   Looks like he needs some more training data - can we provide him with some 
 ?

 If I remember correctly, there was a sample file for testing completeness
  of Bengali fonts. Since it has all letters and conjuncts typed-in, the
  file might
  be useful for training Tesseract as well .

  Deepayan should be able to give some input here. He has working experience
  with R and may have some training sample as well.

Well, we have a bunch of unicode documents. For some of them, I have
print versions too, and can scan them if needed. A simpler approach
would be to render them using different fonts and take screenshots.

Apparently he also needs some box-files, whatever they are, which need
to be produced using tesseract. I haven't installed tesseract yet, and
will try, but let me know if anyone else manages.

-Deepayan
-
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
___
Bengalinux-core mailing list
Bengalinux-core@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bengalinux-core