Am 11.03.21 um 07:24 schrieb Tilman Hausherr:
new report
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz

The content differences part is now the smallest ever, likely due to my change in tika-eval (TIKA-3314) and restoring a PDFBox code segment I accidentally deleted (PDFBOX-5115).
Cool!!

There are three new exceptions. Two are in jempbox and one is in tika itself so I suspect PDFBox isn't to blame. I'll look at it too if I have the time.
As far as I remember the jempbox issue isn't new, Tim mentioned it some time ago. Just out of curiosity does it make sense to use an old lib to extract metadata? Is there anything missing in xmpbox but available in jempbox?


Andreas


Tilman


Am 08.03.2021 um 11:17 schrieb Tilman Hausherr:
new report:
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_3.tar.xz

Tilman

Am 08.03.2021 um 10:35 schrieb Tilman Hausherr:
I think we're good (despite the differences, most of which are because of the soft hyphen), but I'm now experimenting with a modified version of tika-eval to see what happens.

Tilman

Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
new report at

http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz

Tilman

Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
Report is here:

http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz


There's not much changed. No new exceptions. Re content, the changes that seem important are all related to "soft hyphen".

https://issues.apache.org/jira/browse/PDFBOX-5115

I am currently fixing this, and then I'll run the tests again. The text extraction differences will likely stay. It's possible that a change in tika-eval is needed too.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to