Hi, I think I responded to this already in part. Let me explain a bit more, though.
OCRopus assumes correctly thresholded 300 dpi documents. So, if you give it images that are substantially lower resolution (e.g., screen shots), it's going to output junk. Also, if the document threshold is incorrect, it's going to output junk as well. Also, the text/image segmentation code isn't integrated into the release version yet, so if you give it documents with images that look a bit like text when binarized (but aren't), it's going to output junk for those image regions (in addition to text). We've developed tools in the group to improve many of those steps, including text/image segmentation and better thresholding. We couldn't integrate those into OCRopus in the past because the tools were written in Python and OCRopus was written in C++ and Lua. However, we have largely completed a Python frontend to OCRopus now (in 0.4.4), and with that, we can now start integrating some of these smarter document image analysis techniques. For now, the new recognition commands (ocropy/ocropus-binarize, ocropy/ ocropus-pseg, ocropy/ocropus-linerec) will give you more diagnostic information about threshold and document resolutions and warn you if there is something iffy about your images. They also let you preview better what's going on inside the recognizer. This is new code, and I'll be documenting it over the next few weeks. Tom On Mar 12, 12:56 pm, "Thomas L. Packer" <[email protected]> wrote: > I'm having similar issues and hope you get an answer to this question. > > Thomas L. Packer > ~~~~~~~~~~~~~~~~~~~~ > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf > > Of groupmeister > Sent: Tuesday, March 09, 2010 9:30 PM > To: ocropus > Subject: introduction and request for help getting up and running > > Howdy, > > I posted a message a few days earlier but haven't seen it show up, > asking for some help getting up and running. I notice now that there > is a note about posting an introduction of yourself first before > posting to the group. Maybe that's why it hasn't shown up. Sorry I > didnt notice that before, I've been using usenet for 21 years and am > not used to things working that way. Moderated groups have existed > forever, but I hadn't come across one that requires a personal > introduction first. I'm happy to do that though. > > I'm a physician, and used to be a computer programmer. I have a > medical services software company that may have a use for OCR in > medical records management. We are interested in OCRopus. I have a > small development team I lead, I want to try out OCRopus myself > personally and see what it's capable of. If it looks good enough to > help us, I will start some projects on our end to make use of it, > which would almost certainly lead to code contributions from us. We > have made significant contributions to a number of other open source > projects already in the course of our work. > > The developer I was going to assign this to looked into things and he > said he thought the project was in an unstable alpha phase and was not > likely to be useful to us. Looking through things myself, I don't > think that's true. Also I see that the next release (0.5) will be > easier to integrate into other software products. Anyway, i want to > see for myself if this can be helpful to us or not. > > At any rate, I have been trying to do a proof of concept install of > OCRopus to evaluate what it can do, and I can't get it to produce non- > garbage output. I believe I have followed the step-by-step install > instructions for ubuntu properly. I'm wondering if there's something > I'm missing - do I need to train the system for a while before using > it? It seems to come with pre-formed models. Are there settings I can > tweak for processor speed, etc? Maybe the fact that I'm running it in > a VM affects things. > > If anyone could help me get up and running, I'd appreciate it. The > full details of my installation/get up and running problem is pasted > below, from my first post attempt. > > Thanks very much- > > Jack > > Howdy.... I am wondering if anyone can help me get up and running on > ubuntu 9.1. I have a company that has some customized IT, we run some > custom-made software packages for my company on our servers, I'm > wondering if there's a chance this OCR package could be useful to us, > if so I may ask our IT people to look at integrating it into some > tools we use. There is definitely potential for us to contribute back > to the project if we wind up using it. I myself used to be a hobbyist > computer programmer until around 1995, after which I didnt have any > time for it. > > Anyway, sorry for all the irrelevant babble. I have a VMware "virtual > applicance" image of ubuntu 9.1 desktop version which I downloaded > from the operating systems section of the virtual appliance > marketplace section of VMware's website. I installed ocropus on that > virtual Ubuntu applicance which appeared to go without a hitch, though > frankly there are one or two errors which appeared to be non-fatal and > which I shall profusely apologize, I didn't copy down. > > At any rate, I have ocropus fully up and running, but it always comes > up with garbage whatever I throw at it, whether with the one-line page > command or the book2lines,etc., sequence of commands which is > recommended. > > My questions: > > (1) Does it come out-of-the-box trained on a recognition model, or is > it necessary to train it on a model to get it to work? > > (2) When I test it out-of-the-box on a sample document that contains > what I think is easily recognizable test, I get this: > > d...@ubuntu:~/ocropus$ ocropus page test.png > [info] got 895 bboxes > [info] all = 0 > -~-- -. . > c|T 5G > [error] beam search failed > r2#,:-Y//>Ni| o : > %d : 6!5## ## #a Bk|v| > %25` ?Yl,?67o7 ao`|2/#|/e > 9"q . "|`'+,oo<1|n|c > %;: .e;::|:rt: |--- e(< |`42 ?6: 5a: > tt+c:~ EMn-- //. ~# > ;#; ..%::;:::- "-"##i > o,OB< y2~/<X#<, #|e.#--| > [error] beam search failed > #;;;. .;:o::: > ew7.$;:L#]i:I,r5t{- ..s7AT -- o3/||/|q ot24.. > [error] beam search failed > ## a: |;;A,%|2:2:# t > Tn%=c~ > g;V;#U%-|r:#Y6:" ## # > ;aa2<5J"| ::Li%Y:Ic < > f? ;;z%;;a-. #`-""9## ` ` @`|h "boo| | RR| su| > ##V-.-. > ;Jg > l e|i ' ` ,"d "||V,,#| |,| ,,t / Z ~ J 4#/ s > #g;a :+. c > |osr Acct Ac7 ` 1a:#/ ! .Z -2 # / l ~ lll]yl l!l[g# l > [error] beam search failed > .---- #9" "66.` ` eer|#arent a-aa|ohe > # i 5`- | c7 R|An wi7Rout covro #j ]l;l+l#l[Ill:l > [error] beam search failed > [error] beam search failed > or. ="n-- > '#4=7#%:G;|;7-f:|::b > `o|< #~#), >/<, "##,. #-,'~ > [error] beam search failed > ;<.';:L:#i:|-r5t|- > [error] beam search failed > g;#<;U|%; :uc::-# e ge. ~ > [error] beam search failed > ,c-.- . > . > -~-. > ---. 9. ..- "` .=9&` ##o$o#>| > vad...@ubuntu:~/ocropus$ > > (2) When I execute the command to check out the default model, I see: > > vad...@ubuntu:~/ocropus$ ocropus cin > model' > Linerec > linerec_verbose=0 > linerec_grouper=SimpleGrouper > linerec_use_reject=1 > linerec_use_priors=0 > linerec_invert=1 > linerec_space_fractile=0.5 > linerec_space_min=0.2 > linerec_minheight=10 > linerec_maxheight=300 > linerec_space_max=1.1 > linerec_space_yes=1 > linerec_maxaspect=1 > linerec_segmenter=DpSegmenter > linerec_classifier=latin > linerec_space_multiplier=2 > linerec_extractor=scaledfe > linerec_cpreload=none > linerec_space_no=5 > linerec_minclass=32 > linerec_maxcost=20 > linerec_maxrange=5 > linerec_minprob=1e-06 > segmenter: curved cut segmenter > grouper: SimpleGrouper > counts: 126 2309208 > CHARCLASS MODEL > MLP > mlp_normalization=-1 > mlp_hidden_hi=80 > mlp_noopt=0 > mlp_hidden_lo=20 > mlp_cv_max=5000 > mlp_cds=rowdataset8 > mlp_eta=0.5 > mlp_miters=8 > mlp_hidden_varlog=1.2 > mlp_sparse=-1 > mlp_hidden_min=5 > mlp_hidden_max=300 > mlp_rounds=8 > mlp_nensemble=4 > mlp_%error=0.0267507 > mlp_eta_varlog=1.5 > mlp_eta_init=0.5 > mlp_crossvalidate=1 > mlp_extractor=none > mlp_cv_split=0.8 > mlp_%nsamples=2.30921e+06 > ninput 900 nhidden 90 noutput 93 > w1 [-15.0474,25.0991] b1 [-24.2937, > w2 [-24.3254,14.7334] b2 [-5.92876, > JUNKCLASS MODEL > MLP > mlp_normalization=-1 > mlp_hidden_hi=80 > mlp_noopt=0 > mlp_hidden_lo=20 > mlp_cv_max=5000 > mlp_cds=rowdataset8 > mlp_eta=0.5 > mlp_miters=8 > mlp_hidden_varlog=1.2 > mlp_sparse=-1 > mlp_hidden_min=5 > mlp_hidden_max=300 > mlp_rounds=8 > mlp_nensemble=4 > mlp_%error=0.0232608 > mlp_eta_varlog=1.5 > mlp_eta_init=0.5 > mlp_crossvalidate=1 > mlp_extractor=none > mlp_cv_split=0.8 > mlp_%nsamples=5.27385e+06 > ninput 900 nhidden 103 noutput 2 > w1 [-34.5243,30.0879] b1 [-29.1538, > w2 [-12.044,12.044] b2 [-0.0241522, > ULCLASS MODEL > MLP > mlp_normalization=-1 > mlp_hidden_hi=80 > mlp_noopt=0 > mlp_hidden_lo=20 > mlp_cv_max=5000 > mlp_cds=rowdataset8 > mlp_eta=0.5 > mlp_miters=8 > mlp_hidden_varlog=1.2 > mlp_sparse=-1 > mlp_hidden_min=5 > mlp_hidden_max=300 > mlp_rounds=8 > mlp_nensemble=4 > mlp_eta_varlog=1.5 > mlp_eta_init=0.5 > mlp_crossvalidate=1 > mlp_extractor=none > mlp_cv_split=0.8 > ninput 0 nhidden 0 noutput 0 > vad...@ubuntu:~/ocropus$ > > Question: Am I doing something wrong? I dont understand why the > results of ocropus page are so garbagey. I see a few places where > there are 3 characters in a row I suspect may have been correctly > identified. > > Should I be able to run it right out of the box, following the install > instructions, and running it using the page option immediately? > > The VMware appliance is running on a 2.4 ghz core 2 duo mac running os > 10.6.2, vmare v 3.02 > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group > athttp://groups.google.com/group/ocropus?hl=en. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
