Hi, I'm fitting a Spark ML model using CrossValidatorModel and noticed something unexpected while profiling a small dummy dataset with fewer than 100 rows.
A significant part of the runtime appears to be spent repeatedly processing/serializing closures related to the same RDDLossFunction class. >From what I can tell, this happens hundreds of times, likely once per serialized closure/task during the repeated loss-function evaluations. While looking into it, I noticed that ClosureCleaner.getClassReader does not seem to use any cache, so the same class metadata may be loaded and parsed repeatedly. I also saw the TODO in ClosureCleaner about caching outerClasses, innerClasses, and accessedFields. Do you think it would make sense to add a small bounded LRU cache for class-level analysis in ClosureCleaner, for example around getClassReader or the derived closure metadata? I’m interested in your opinion before opening a JIRA ticket or proposing a patch. Regards, Ángel Álvarez
