I'm testing out OCRing PDF tables using Tesseract OCR. I'm borrowing the
concept from here:
http://craiget.com/extracting-table-data-from-pdfs-with-ocr/

I've made good progress using the PPM examples on rosetta code and am
having fun with it. (NOTE: I am starting off with an image resized down to
157x158. The original image is 5100x6601 -- 600 dpi)

Here's an image of my progress.
http://imgur.com/a/3fcKK

I'm determining a "line" by seeing if the rolling sum of the previous 10
points is zero. I don't want all black pixels, just the ones that
constitute a line.  I'm stuck because this simple approach of is
compressing the matrix with the infix I think. I'm not yet saavy enough
with matrices to figure out what to do from here.

   $ xb
58 157
   $ hlines
58 148
   $ vlines
49 157

My next logical step (assuming the matrices were equal) was to essentially
AND them together so that I had a combined image/matrix of black/white for
the vertical and horizontal lines.

I was then going to attempt to chop up the image like in the python blog
post and feed it to Tesseract.

Any tips or taking it further would be great. Thanks for the help

You can get the PPM here:
https://www.dropbox.com/s/qoi1glkqs0tfezs/small.ppm

require 'files'

readppm=: monad define
  dat=. fread y                                           NB. read from file
  msk=. 1 ,~ (*. 3 >: +/\) (LF&=@}: *. '#'&~:@}.) dat     NB. mark field
ends
  't wbyh maxval dat'=. msk <;._2 dat                     NB. parse
  'wbyh maxval'=. 2 1([ {. [: _99&". (LF,' ')&charsub)&.> wbyh;maxval  NB.
convert to numeric
  if. (_99 0 +./@e. wbyh,maxval) +. 'P6' -.@-: 2{.t do. _1 return. end.
  (a. i. dat) makeRGB |.wbyh                              NB. convert to
basic bitmap format
)

makeRGB=: 0&$: : (($,)~ ,&3)
fillRGB=: makeRGB }:@$
setPixels=: (1&{::@[)`(<"1@(0&{::@[))`]}
getPixels=: <"1@[ { ]

NB. viewmat _50 (+ / % #) \ _50 (+ / % #)\"1 x2

z=:readppm 'c:/temp/small.ppm'

NB. compress the RGB into a single number
x2=:+/"1 z

NB. convert the RGB into a binary if it's black/white
xb =: 500 <: x2

hlines=:(10 (+/)\"1 xb) = 0
vlines=:(10 (+/)\ xb) = 0
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to