Am 15.06.22 um 13:07 schrieb Tim Allison:
In "parse_time_millis_details.xlsx", there are some that took much longer
in 3.x during the multithreaded run but do not show much of a difference
singlethreaded...likely accidents of resources available at parse time.

Overall, the sum of processing times across all files is very similar.

However, I did find two files that really do take up far more time single
threaded in 3.x vs. 2.x.  Again, I'm not sure these need to be dealt with
immediately, and the time required may be a fault of Tika, not PDFBox.
I did some rendering tests and I can't see any significant difference, but I didn't do a scientific test with real figures ;-)

commoncrawl3_refetched/SO/SONYLMWCHDDEOC3D5OHEXDTOJ7NGVODV
The file looks like a pdf containing arabic text, but most of the text isn't text at all. The pdf uses line graphics for the content. So, the question is, what does TIKA in such cases and why seems 3.x be slower that than 2.x?

commoncrawl3_refetched/OL/OLZ5TAS53B4BDC673OFMWZE5DDZ7ZGIN
This file is similar to the other one. It contains a lot of graphics and not that much text.

Maybe something with the rendering code and/or default settings is different and leads to slower results in 3.x?

Andreas


On Wed, Jun 15, 2022 at 6:49 AM Tim Allison <[email protected]> wrote:

I had a chance to look at new_catastrophic_exceptions_in_b, and the three
files in there take roughly the same amount of time and resources.  I think
they failed on trunk only because of the whims of multithreading and
available resources at the time.

This file is admittedly quite large, but it was able to take up an
unhealthy amount of resources (both RAM and CPU):
bug_trackers/evince/evince-LINK-1250-0.pdf in both 2.x and 3.x (sourrce:
https://gitlab.gnome.org/GNOME/evince/-/issues/1250).  I don't think
there's anything to be done for that one immediately.


On Wed, Jun 15, 2022 at 6:19 AM Tim Allison <[email protected]> wrote:

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz

On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <[email protected]> wrote:

Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).

On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <[email protected]>
wrote:

I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456

@Tim is there any chance to rerun the regression tests?

Thanks in advance
Andreas

Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
I've found another regression, see PDFBOX-5452

Andreas

Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
Thanks Tim,

looks like there are some regressions, see PDFBOX-5444 and
PDFBOX-5447.

Maybe there are more to come ....

Andreas


Am 26.05.22 um 15:04 schrieb Tim Allison:
Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
The
reports are here:

https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz

Happy to rerun with a more recent version of trunk.

Cheers,

        Tim

On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <
[email protected]> wrote:

Am 06.05.22 um 14:30 schrieb Tim Allison:
All,
     Let me know when makes sense to run the text extraction
regression
Yes, it'd be useful to have some update results.

How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2
vs.
3.0.0-alpha3?


tests for 3.x.  I regret I haven't been following our mailing
list as
closely as I should be.
No need to worry, everything is fine.

Andreas


              Cheers,

                          Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to