Am 11.03.21 um 07:24 schrieb Tilman Hausherr:
new report
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz
The content differences part is now the smallest ever, likely due to my change
in tika-eval (TIKA-3314) and restoring a PDFBox code segment I accidentally
deleted (PDFBOX-5115).
Cool!!
There are three new exceptions. Two are in jempbox and one is in tika itself so
I suspect PDFBox isn't to blame. I'll look at it too if I have the time.
As far as I remember the jempbox issue isn't new, Tim mentioned it some time
ago. Just out of curiosity does it make sense to use an old lib to extract
metadata? Is there anything missing in xmpbox but available in jempbox?
Andreas
Tilman
Am 08.03.2021 um 11:17 schrieb Tilman Hausherr:
new report:
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_3.tar.xz
Tilman
Am 08.03.2021 um 10:35 schrieb Tilman Hausherr:
I think we're good (despite the differences, most of which are because of the
soft hyphen), but I'm now experimenting with a modified version of tika-eval
to see what happens.
Tilman
Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
new report at
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz
Tilman
Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
Report is here:
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz
There's not much changed. No new exceptions. Re content, the changes that
seem important are all related to "soft hyphen".
https://issues.apache.org/jira/browse/PDFBOX-5115
I am currently fixing this, and then I'll run the tests again. The text
extraction differences will likely stay. It's possible that a change in
tika-eval is needed too.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org