Hi,

I'm fitting a Spark ML model using CrossValidatorModel and noticed
something unexpected while profiling a small dummy dataset with fewer than
100 rows.

A significant part of the runtime appears to be spent repeatedly
processing/serializing closures related to the same RDDLossFunction class.
>From what I can tell, this happens hundreds of times, likely once per
serialized closure/task during the repeated loss-function evaluations.

While looking into it, I noticed that ClosureCleaner.getClassReader does
not seem to use any cache, so the same class metadata may be loaded and
parsed repeatedly. I also saw the TODO in ClosureCleaner about caching
outerClasses, innerClasses, and accessedFields.

Do you think it would make sense to add a small bounded LRU cache for
class-level analysis in ClosureCleaner, for example around getClassReader
or the derived closure metadata?

I’m interested in your opinion before opening a JIRA ticket or proposing a
patch.

Regards,
Ángel Álvarez

Reply via email to