Hi,
Thanks, I think this is OK and we should go ahead with the release. The
improvements are mostly from fixing the problem that was on the "east
european" docs (PDFBOX-4720).
Some reduction in common words are because when "XXX" counts as a common
word, "XXX's" doesn't, and the bug missed that '.
Could you, when time permits, run the test between 2.0.17 and 2.0.19,
and with the rotationMagic option for both ? This way we wouldn't see
the "east european docs" improvements.
(I would have missed the changes in 718808.pdf . But I don't think these
are important because it is a rotated text. The extraction looks OK when
using the "rotationMagic" option. Same for slight differences in
573977.pdf)
Thanks
Tilman
Am 19.02.2020 um 21:27 schrieb Tim Allison:
Sorry for the delays!
Reports are available here:
http://162.242.228.174/reports/reports_pdfbox_2.0.19-prerc1.tgz
There's a small decrease in "common words", but in looking at specifics, it
doesn't look meaningful...it looks like junk before and junk after.
The one negligible exception to this is: govdocs1/718/718808.pdf, but
that's not a showstopper, especially given the others where the text
appears to be better:
commoncrawl2/K2/K2VSDCR4CKNZAS35EUAVUNPUDE2UJXE7
govdocs1/287/287741.pdf
There's something wrong with the attachment by mime comparison
SQL/extraction that I need to look into, but the same number of attachments
were extracted. I'm not concerned.
In short, all looks good to me. Let me know what you think.
Best,
Tim
On Wed, Feb 19, 2020 at 6:05 AM Tim Allison <[email protected]> wrote:
Results in a few hours...
On Mon, Feb 17, 2020 at 1:50 PM Andreas Lehmkuehler <[email protected]>
wrote:
As Tims test results aren't available yet I'm going to postpone the
release for
another day or three (I'm busy on Wednesday).
Andreas
Am 11.02.20 um 21:13 schrieb Andreas Lehmkühler:
I'm planning to cut the release next Monday.
@Tim please run the regression tests if possible
Thanks in advance
Andreas
Am 7. Februar 2020 01:22:34 MEZ schrieb Tim Allison <
[email protected]>:
If you’re up for it, that’d be great! Let me know when I should run the
regression tests.
Thank you!
On Thu, Feb 6, 2020 at 1:36 PM Andreas Lehmkuehler <[email protected]>
wrote:
Am 06.02.20 um 13:14 schrieb Tim Allison:
Hi All,
We're probably 3ish* weeks away from the next release cycle for
Apache
Tika. I realize PDFBox 2.0.18 just came out at the end of
December. Are
there any plans/desires for a 2.0.19 release that could make it in
to the
next Tika?
I have no plans so far but how about cutting a release in about 10
days
from now?
Andreas
Cheers,
Tim
*3ish weeks -- as measured by Open Source Standard Time :D
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]