Hi Dimuthu > 1 Print those data into PDDocument again and pass through TextStripper > of PDFBox. This could reduce the performance of overall process.
This was what I had in mind, but rather than printing the text into the PDDocument you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned something like this a while ago: > You could subclass PDFTextStripper and override the startDocument method and > use it to create a PDFRenderer and store it in a field. Then override the > processPage method and use the previously created PDFRenderer to render the > current page to a buffered image and perform OCR on the image. Once you have > the OCR text + positions, instead of calling processStream you can call > processTextPosition once for each character + position. Let’s see how well it works and then re-evaluate. -- John