Hello,
I'm loosing my hair on coordinates conversion and image extraction.

Here is what I'm trying to do :

I want to perform keyword search on non-searchable pdf or pdfs where text layer is not well positioned behind images (and then underline the results using annots) using PDFBOX and an OCR:

I've extended printImageLocation  the following way :
On a given page I extract all images and generate png images with JAI for better quality (tried getting a sole image for the whole page but results are not good enough with the OCR due to layout issues I think, with JAI I expect to be able to posterize, reduce noise if necessary, etc...to make the ocr happy). I externally run an ocr on them (ocropus/tesseract. it's c++, so I have some "Process p = Runtime.getRuntime().exec(cmd); " code) which produces hOCR files giving text and coordinates for each characters. I'm then able to determine the coordinates of a keyword parsing the hOCR file. At this point, I have the coordinates of the keyword in the image, the position of the image on the page and the size of the image. I then try to "translate to" coordinates in the pdf page from the ones I have got from the parsed image. First I invert the bounding box as the OCR gives me a UpperLeft/ LowerRight couple of points. then ...I'm stucked : I expected the origin to be lowerleft in a pdf page but it seems to be upperLeft here. and to be honest, I hardly figure out which corner of the image is used to determine its location and what is the metric used.
Inside the image, I retrieve coordinates in dot.

For example, here are the images I've found :
[I0] at 571.26746,71.80139 size=796.0658,93.23215 (small logo)
[I1] at 368.0984,85.12024 size=92.90973,196.4537 (small logo)
[I2] at 583.11694,707.5416 size=12841.42,15587.612 (the scanned article)
[I3] at 176.53192,341.2494 size=402.6675,1046.7035 (image attached to the article)

visually, [I0] is upperLeft, [I1] is at [I0] right side, [I3] is upper right but below [I0] and i1 line. [I2] is the "body" of the page actually a press article, where I find the keyword's occurences.

here is a set of coordinates retrieved from the ocr processing (upper left / lower right):
keyword: (2056.0/2484.0) (2193.0/2501.0)

which gives (lower left / upper right):
(2056.0/2501.0) (2193.0/2484.0)

here are the coordinates of the same occurence in the pdf (the result I would find after a conversion lowerleft / upper right. Provided here parsing the text layer hopefully well positionned) : START : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399 yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword END : String[ xy=511.5022,665.7338 fontsize=33.0 xscale=0.24686399 yscale=0.225744 height=5.579715 space=2.2647307 width=128.40302] = keyword --------->keyword : 1790.4882,2322.2808, 1881.345,2348.561 (the bounding box converted in a suitable metric system to put annotations on it)

I guess I have to set up a transformation matrix but I don't know what parameters I have to take into accounts (and if they are available in a way or another !).
Could someone provide some advices ?
Thanks
fb.


Reply via email to