I have created two issues on parsing exceptions, and it's not
PDFBOX-5283. Maybe it's the same, maybe not. Re text extraction, I
looked at one of the files (414724.pdf) and there's also a parsing
warning, so maybe that is related too so lets just wait.
Tilman
Am 22.03.2022 um 18:21 schrieb Tilman Hausherr:
I don't have much time right now, but I just tested 077867.pdf and
392443.pdf and it's definitively a regression. I wonder if it was
PDFBOX-5283.
The files in content_diffs_no_exceptions.xls where the T column is non
empty are suspicious and need more investigation.
Tilman
Am 22.03.2022 um 13:29 schrieb Tim Allison:
Reports are here:
https://corpora.tika.apache.org/base/reports/tika-2.3-vs-2.4-pdfs.tgz
It looks like no significant changes. Some diffs on a few files, but
this was run on ~800k PDFs.
There are a couple of cases where a file is now being detected as
rfc822 instead of PDF. We have to fix that on the Tika side.
On Mon, Mar 21, 2022 at 12:53 PM Andreas Lehmkuehler
<andr...@lehmi.de> wrote:
Am 21.03.22 um 12:21 schrieb Tim Allison:
I'm happy to run the tests today if that would be of any interest.
Yes, please.
TIA
Andreas
On Sun, Mar 20, 2022 at 5:01 PM Andreas Lehmkuehler
<andr...@lehmi.de> wrote:
Am 13.03.22 um 14:20 schrieb Tim Allison:
From Tika's perspective, there's no rush. We're waiting for a
bug fix
in POI (TIKA-3699).
Please let me know if/when I should run the regression tests.
Thanks for the offer. Do we need to run the tests before cutting
the release?
Most of the tickets aren't related to text extraction. Those which
are related
should decrease the number of exceptions and increase the accuracy.
WDYT?
Thank you, all!
Cheers,
Tim
On Sat, Mar 12, 2022 at 5:29 AM Andreas Lehmkuehler
<andr...@lehmi.de> wrote:
Am 11.03.22 um 08:30 schrieb Tilman Hausherr:
Am 11.03.2022 um 08:19 schrieb Andreas Lehmkuehler:
Am 10.03.22 um 20:16 schrieb Tilman Hausherr:
I'd agree but that might mean PDFBOX-5384 wouldn't be fixed.
It's there for quite some time and it seems to be a seldom
corner case. IMHO
it can wait if we won't find a solution before Monday.
No, that one was created on March 2nd. Oliver has just posted a
suggestion so
maybe that is a solution.
The ticket is quite new, but the issue itself was introduced in
2018 with
2.0.12. ;-)
However, I'll have a look at the proposed solution.
Andreas
Tilman
WDYT?
Andreas
Tilman
Am 10.03.2022 um 19:05 schrieb Andreas Lehmkuehler:
Am 09.03.22 um 17:07 schrieb Tim Allison:
All,
I've been out of the office for a bit and haven't caught up
yet.
Apologies if I've missed the discussion.
Are there plans for a 2.0.26 release? We're probably a few
weeks out
How about cutting the release next Monday?
Andreas
from starting our next 1.x and 2.x releases on Tika, and it
would be
great to incorporate 2.0.26. No problem at all if 2.0.26
is slated
for later.
Thank you!
Cheers,
Tim
On Fri, Mar 4, 2022 at 10:46 PM Tilman Hausherr
<thaush...@t-online.de> wrote:
Am 24.02.2022 um 07:41 schrieb Andreas Lehmkuehler:
Am 22.02.22 um 07:49 schrieb Andreas Lehmkuehler:
Hi,
I'm planning to cut a new JBIG2 release next week. There
aren't that
much changes but I think the fixes are worth to be
released. [1]
I'm going to cut the release next weekend, if nobody
objects.
Once it is done we should think about a 2.0.26 release of
PDFBox
Yes please!
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org