Maybe this could be done with the ExtractTextByArea example. However IIRC the coordinates are awt-like (y 0 on top) coordinates, so the PDF coordinates should somehow be mapped to this.

Tilman

Am 21.07.2021 um 18:21 schrieb Tim Allison:
https://stackoverflow.com/questions/68402058/tika-isnt-reading-pdf-properly

Not sure there's much we should do on the Tika side.

How hard would it be to add an "extract only text that is on the page" feature?

Best,

        Tim


Reply via email to