Hi, A quick follow-up: I traced class loading/serialization during a 47-second test run and saw these top 3 repeated counts:
RDDLossFunction.class 3825 times SparkContext.class 1437 times RDD.class 1437 times This seems to reinforce the idea that the same class-level metadata is being analyzed repeatedly during closure cleaning. Since Spark already uses Guava cache elsewhere, would a small bounded, configurable cache make sense here? I assume lambda/dynamically generated classes should be excluded or handled carefully. Anything else worth considering before I open a JIRA ticket or propose a patch? Regards, Ángel Álvarez El vie, 15 may 2026, 22:44, Ángel Álvarez Pascua < [email protected]> escribió: > Hi, > > I'm fitting a Spark ML model using CrossValidatorModel and noticed > something unexpected while profiling a small dummy dataset with fewer than > 100 rows. > > A significant part of the runtime appears to be spent repeatedly > processing/serializing closures related to the same RDDLossFunction > class. From what I can tell, this happens hundreds of times, likely once > per serialized closure/task during the repeated loss-function evaluations. > > While looking into it, I noticed that ClosureCleaner.getClassReader does > not seem to use any cache, so the same class metadata may be loaded and > parsed repeatedly. I also saw the TODO in ClosureCleaner about caching > outerClasses, innerClasses, and accessedFields. > > Do you think it would make sense to add a small bounded LRU cache for > class-level analysis in ClosureCleaner, for example around getClassReader > or the derived closure metadata? > > I’m interested in your opinion before opening a JIRA ticket or proposing a > patch. > > Regards, > Ángel Álvarez >
