Hi,

Thanks, I think this is OK and we should go ahead with the release. The improvements are mostly from fixing the problem that was on the "east european" docs (PDFBOX-4720).

Some reduction in common words are because when "XXX" counts as a common word, "XXX's" doesn't, and the bug missed that '.

Could you, when time permits, run the test between 2.0.17 and 2.0.19, and with the rotationMagic option for both ? This way we wouldn't see the "east european docs" improvements.

(I would have missed the changes in 718808.pdf . But I don't think these are important because it is a rotated text. The extraction looks OK when using the "rotationMagic" option. Same for slight differences in 573977.pdf)

Thanks
Tilman

Am 19.02.2020 um 21:27 schrieb Tim Allison:
Sorry for the delays!

Reports are available here:
http://162.242.228.174/reports/reports_pdfbox_2.0.19-prerc1.tgz

There's a small decrease in "common words", but in looking at specifics, it
doesn't look meaningful...it looks like junk before and junk after.

The one negligible exception to this is: govdocs1/718/718808.pdf, but
that's not a showstopper, especially given the others where the text
appears to be better:
commoncrawl2/K2/K2VSDCR4CKNZAS35EUAVUNPUDE2UJXE7
govdocs1/287/287741.pdf

There's something wrong with the attachment by mime comparison
SQL/extraction that I need to look into, but the same number of attachments
were extracted.  I'm not concerned.

In short, all looks good to me.  Let me know what you think.

Best,

      Tim

On Wed, Feb 19, 2020 at 6:05 AM Tim Allison <[email protected]> wrote:

Results in a few hours...

On Mon, Feb 17, 2020 at 1:50 PM Andreas Lehmkuehler <[email protected]>
wrote:

As Tims test results aren't available yet I'm going to postpone the
release for
another day or three (I'm busy on Wednesday).

Andreas

Am 11.02.20 um 21:13 schrieb Andreas Lehmkühler:
I'm planning to cut the release next Monday.

@Tim please run the regression tests if possible

Thanks in advance
Andreas

Am 7. Februar 2020 01:22:34 MEZ schrieb Tim Allison <
[email protected]>:
If you’re up for it, that’d be great! Let me know when I should run the
regression tests.

Thank you!

On Thu, Feb 6, 2020 at 1:36 PM Andreas Lehmkuehler <[email protected]>
wrote:

Am 06.02.20 um 13:14 schrieb Tim Allison:
Hi All,

     We're probably 3ish* weeks away from the next release cycle for
Apache
Tika.  I realize PDFBox 2.0.18 just came out at the end of
December.  Are
there any plans/desires for a 2.0.19 release that could make it in
to the
next Tika?
I have no plans so far but how about cutting a release in about 10
days
from now?

Andreas

        Cheers,

                 Tim

*3ish weeks -- as measured by Open Source Standard Time :D


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to