In "parse_time_millis_details.xlsx", there are some that took much longer
in 3.x during the multithreaded run but do not show much of a difference
singlethreaded...likely accidents of resources available at parse time.

Overall, the sum of processing times across all files is very similar.

However, I did find two files that really do take up far more time single
threaded in 3.x vs. 2.x.  Again, I'm not sure these need to be dealt with
immediately, and the time required may be a fault of Tika, not PDFBox.

commoncrawl3_refetched/SO/SONYLMWCHDDEOC3D5OHEXDTOJ7NGVODV
commoncrawl3_refetched/OL/OLZ5TAS53B4BDC673OFMWZE5DDZ7ZGIN


On Wed, Jun 15, 2022 at 6:49 AM Tim Allison <talli...@apache.org> wrote:

> I had a chance to look at new_catastrophic_exceptions_in_b, and the three
> files in there take roughly the same amount of time and resources.  I think
> they failed on trunk only because of the whims of multithreading and
> available resources at the time.
>
> This file is admittedly quite large, but it was able to take up an
> unhealthy amount of resources (both RAM and CPU):
> bug_trackers/evince/evince-LINK-1250-0.pdf in both 2.x and 3.x (sourrce:
> https://gitlab.gnome.org/GNOME/evince/-/issues/1250).  I don't think
> there's anything to be done for that one immediately.
>
>
> On Wed, Jun 15, 2022 at 6:19 AM Tim Allison <talli...@apache.org> wrote:
>
>> Reports are here:
>> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>>
>> On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <talli...@apache.org> wrote:
>>
>>> Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).
>>>
>>> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <andr...@lehmi.de>
>>> wrote:
>>>
>>>> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>>>>
>>>> @Tim is there any chance to rerun the regression tests?
>>>>
>>>> Thanks in advance
>>>> Andreas
>>>>
>>>> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
>>>> > I've found another regression, see PDFBOX-5452
>>>> >
>>>> > Andreas
>>>> >
>>>> > Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>>>> >> Thanks Tim,
>>>> >>
>>>> >> looks like there are some regressions, see PDFBOX-5444 and
>>>> PDFBOX-5447.
>>>> >>
>>>> >> Maybe there are more to come ....
>>>> >>
>>>> >> Andreas
>>>> >>
>>>> >>
>>>> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>>> >>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
>>>> The
>>>> >>> reports are here:
>>>> >>>
>>>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>>> >>>
>>>> >>> Happy to rerun with a more recent version of trunk.
>>>> >>>
>>>> >>> Cheers,
>>>> >>>
>>>> >>>        Tim
>>>> >>>
>>>> >>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <
>>>> andr...@lehmi.de> wrote:
>>>> >>>
>>>> >>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>>> >>>>> All,
>>>> >>>>>     Let me know when makes sense to run the text extraction
>>>> regression
>>>> >>>> Yes, it'd be useful to have some update results.
>>>> >>>>
>>>> >>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2
>>>> vs.
>>>> >>>> 3.0.0-alpha3?
>>>> >>>>
>>>> >>>>
>>>> >>>>> tests for 3.x.  I regret I haven't been following our mailing
>>>> list as
>>>> >>>>> closely as I should be.
>>>> >>>> No need to worry, everything is fine.
>>>> >>>>
>>>> >>>> Andreas
>>>> >>>>
>>>> >>>>>
>>>> >>>>>              Cheers,
>>>> >>>>>
>>>> >>>>>                          Tim
>>>> >>>>>
>>>> >>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> >>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>> >>>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> ---------------------------------------------------------------------
>>>> >>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> >>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> >> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>> >>
>>>> >
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> > For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-h...@pdfbox.apache.org
>>>>
>>>>

Reply via email to