Am 23.03.22 um 05:28 schrieb Tilman Hausherr:
I have created two issues on parsing exceptions, and it's not PDFBOX-5283. Maybe it's the same, maybe not. Re text extraction, I looked at one of the files (414724.pdf) and there's also a parsing warning, so maybe that is related too so lets just wait.
Thanks for the quick analysis. I'm going to have a look

Andreas


Tilman

Am 22.03.2022 um 18:21 schrieb Tilman Hausherr:
I don't have much time right now, but I just tested 077867.pdf and 392443.pdf and it's definitively a regression. I wonder if it was PDFBOX-5283.

The files in content_diffs_no_exceptions.xls where the T column is non empty are suspicious and need more investigation.

Tilman


Am 22.03.2022 um 13:29 schrieb Tim Allison:
Reports are here:
https://corpora.tika.apache.org/base/reports/tika-2.3-vs-2.4-pdfs.tgz

It looks like no significant changes.  Some diffs on a few files, but
this was run on ~800k PDFs.

There are a couple of cases where a file is now being detected as
rfc822 instead of PDF.  We have to fix that on the Tika side.

On Mon, Mar 21, 2022 at 12:53 PM Andreas Lehmkuehler <andr...@lehmi.de> wrote:

Am 21.03.22 um 12:21 schrieb Tim Allison:
I'm happy to run the tests today if that would be of any interest.
Yes, please.

TIA
Andreas


On Sun, Mar 20, 2022 at 5:01 PM Andreas Lehmkuehler <andr...@lehmi.de> wrote:
Am 13.03.22 um 14:20 schrieb Tim Allison:
   From Tika's perspective, there's no rush. We're waiting for a bug fix
in POI (TIKA-3699).

Please let me know if/when I should run the regression tests.
Thanks for the offer. Do we need to run the tests before cutting the release?

Most of the tickets aren't related to text extraction. Those which are related
should decrease the number of exceptions and increase the accuracy.

WDYT?


Thank you, all!

Cheers,

               Tim

On Sat, Mar 12, 2022 at 5:29 AM Andreas Lehmkuehler <andr...@lehmi.de> wrote:
Am 11.03.22 um 08:30 schrieb Tilman Hausherr:
Am 11.03.2022 um 08:19 schrieb Andreas Lehmkuehler:
Am 10.03.22 um 20:16 schrieb Tilman Hausherr:
I'd agree but that might mean PDFBOX-5384 wouldn't be fixed.
It's there for quite some time and it seems to be a seldom corner case. IMHO
it can wait if we won't find a solution before Monday.
No, that one was created on March 2nd. Oliver has just posted a suggestion so
maybe that is a solution.
The ticket is quite new, but the issue itself was introduced in 2018 with
2.0.12. ;-)

However, I'll have a look at the proposed solution.

Andreas
Tilman


WDYT?

Andreas

Tilman

Am 10.03.2022 um 19:05 schrieb Andreas Lehmkuehler:
Am 09.03.22 um 17:07 schrieb Tim Allison:
All,

I've been out of the office for a bit and haven't caught up yet.
Apologies if I've missed the discussion.

Are there plans for a 2.0.26 release?  We're probably a few weeks out
How about cutting the release next Monday?

Andreas

from starting our next 1.x and 2.x releases on Tika, and it would be
great to incorporate 2.0.26.  No problem at all if 2.0.26 is slated
for later.

Thank you!

Cheers,

            Tim

On Fri, Mar 4, 2022 at 10:46 PM Tilman Hausherr <thaush...@t-online.de> wrote:
Am 24.02.2022 um 07:41 schrieb Andreas Lehmkuehler:
Am 22.02.22 um 07:49 schrieb Andreas Lehmkuehler:
Hi,

I'm planning to cut a new JBIG2 release next week. There aren't that
much changes but I think the fixes are worth to be released. [1]
I'm going to cut the release next weekend, if nobody objects.

Once it is done we should think about a 2.0.26 release of PDFBox

Yes please!

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to