Hi Dimuthu

> 1 Print those data into PDDocument again and pass through TextStripper
> of PDFBox. This could reduce the performance of overall process.

This was what I had in mind, but rather than printing the text into the 
PDDocument
you can inject it directly into PDFTextStripper as TextPosition instances. I 
mentioned
something like this a while ago:

> You could subclass PDFTextStripper and override the startDocument method and 
> use it to create a PDFRenderer and store it in a field. Then override the 
> processPage method and use the previously created PDFRenderer to render the 
> current page to a buffered image and perform OCR on the image. Once you have 
> the OCR text + positions, instead of calling processStream you can call 
> processTextPosition once for each character + position.

Let’s see how well it works and then re-evaluate.

-- John

Reply via email to