converting a location in an image to pdf coordinates

fb Fri, 05 Dec 2008 06:15:21 -0800

Hello,
I'm loosing my hair on coordinates conversion and image extraction.


Here is what I'm trying to do :

I want to perform keyword search on non-searchable pdf or pdfs wheretext layer is not well positioned behind images (and then underline theresults using annots) using PDFBOX and an OCR:


I've extended printImageLocation  the following way :

On a given page I extract all images and generate png images with JAIfor better quality (tried getting a sole image for the whole page butresults are not good enough with the OCR due to layout issues I think,with JAI I expect to be able to posterize, reduce noise if necessary,etc...to make the ocr happy).I externally run an ocr on them (ocropus/tesseract. it's c++, so I havesome "Process p = Runtime.getRuntime().exec(cmd); " code) whichproduces hOCR files giving text and coordinates for each characters.I'm then able to determine the coordinates of a keyword parsing thehOCR file.At this point, I have the coordinates of the keyword in the image, theposition of the image on the page and the size of the image.I then try to "translate to" coordinates in the pdf page from the ones Ihave got from the parsed image.First I invert the bounding box as the OCR gives me a UpperLeft/LowerRight couple of points.then ...I'm stucked : I expected the origin to be lowerleft in a pdfpage but it seems to be upperLeft here.and to be honest, I hardly figure out which corner of the image is usedto determine its location and what is the metric used.

Inside the image, I retrieve coordinates in dot.

For example, here are the images I've found :
[I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo)
[I1] at 368.0984,85.12024 size=92.90973,196.4537 (small logo)
[I2] at 583.11694,707.5416 size=12841.42,15587.612 (the scanned article)

[I3] at 176.53192,341.2494 size=402.6675,1046.7035 (image attached tothe article)

visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upperright but below [I0] and i1 line.[I2] is the "body" of the page actually a press article, where I findthe keyword's occurences.

here is a set of coordinates retrieved from the ocr processing (upperleft / lower right):

keyword: (2056.0/2484.0) (2193.0/2501.0)

which gives (lower left / upper right):
(2056.0/2501.0) (2193.0/2484.0)

here are the coordinates of the same occurence in the pdf (the result Iwould find after a conversion lowerleft / upper right. Provided hereparsing the text layer hopefully well positionned) :START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keywordEND : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword--------->keyword : 1790.4882,2322.2808, 1881.345,2348.561 (thebounding box converted in a suitable metric system to put annotations on it)

I guess I have to set up a transformation matrix but I don't know whatparameters I have to take into accounts (and if they are available in away or another !).

Could someone provide some advices ?
Thanks
fb.

converting a location in an image to pdf coordinates

Reply via email to