On 01.04.2023 11:41, Tilman Hausherr wrote:
On 30.03.2023 16:27, Tim Allison wrote:
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.27-v-2.0.28-SNAPSHOT.tgz
Thank you Tim!
What I see is
1) Text missing in TOP_10_MORE_IN_B, these might (all?) be related to
the issue that Andreas reopened
2) Different Arabic text => PDFBOX-4531, hopefully these are improvements
3) misc improvements, I'll add two of them to my own extraction
regression tests
Tilman
Also some improved ligature text extraction, this might also be related
to the PDFBOX-4531 changes. It can be seen in govdocs file 433525.pdf,
in the first page "Neutron radiation offers" (ff now appears correctly)
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]