Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17014
I'm trying to refresh my memory and clear the targets on the topic,
basically we want to achieve:
1. Avoid double caching. If Input Dataset is already cached, then we should
not cache the internal RDD.
2. If input Dataset is not cached, some algorithms may need internal RDD
caching to avoid warning from MLlib and also to avoid unnecessary
re-computation. But I'm not sure about the scope. (Should we add this for all
the algorithms? This is a behavior change for many algorithms). I don't think
we have an ideal way to detect if a Dataset should be cached (it's parent may
be cached already), and thus not sure if we should take action based on the
speculative condition.
3. Avoid public API change.
Let me know if I miss anything. I'll check on code now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]