Hi,

A quick follow-up: I traced class loading/serialization during a 47-second
test run and saw these top 3 repeated counts:

RDDLossFunction.class   3825 times
SparkContext.class      1437 times
RDD.class               1437 times

This seems to reinforce the idea that the same class-level metadata is
being analyzed repeatedly during closure cleaning.

Since Spark already uses Guava cache elsewhere, would a small bounded,
configurable cache make sense here? I assume lambda/dynamically generated
classes should be excluded or handled carefully.

Anything else worth considering before I open a JIRA ticket or propose a
patch?

Regards,
Ángel Álvarez



El vie, 15 may 2026, 22:44, Ángel Álvarez Pascua <
[email protected]> escribió:

> Hi,
>
> I'm fitting a Spark ML model using CrossValidatorModel and noticed
> something unexpected while profiling a small dummy dataset with fewer than
> 100 rows.
>
> A significant part of the runtime appears to be spent repeatedly
> processing/serializing closures related to the same RDDLossFunction
> class. From what I can tell, this happens hundreds of times, likely once
> per serialized closure/task during the repeated loss-function evaluations.
>
> While looking into it, I noticed that ClosureCleaner.getClassReader does
> not seem to use any cache, so the same class metadata may be loaded and
> parsed repeatedly. I also saw the TODO in ClosureCleaner about caching
> outerClasses, innerClasses, and accessedFields.
>
> Do you think it would make sense to add a small bounded LRU cache for
> class-level analysis in ClosureCleaner, for example around getClassReader
> or the derived closure metadata?
>
> I’m interested in your opinion before opening a JIRA ticket or proposing a
> patch.
>
> Regards,
> Ángel Álvarez
>

Reply via email to