Github user squito commented on the issue:
https://github.com/apache/spark/pull/21346
> is this effectively dead code at this point?
yes, thats right. this PR by itself is not useful. Its a step towards
https://github.com/apache/spark/pull/21451
This is a good point to put in the PR summary -- I'll do that, and also
your summary notes above, if you don't mind.
> what are the major risks of this change in terms of introducing
performance or correctness issues? If we identify risks (e.g. "this is a
historically tricky area of code?"), can we mitigate those risks through
correctness testing / load testing?
I've made an effort to make minimal modifications to all existing code
paths, to minimize the risk of introducing bugs in current functionality. My
intention is to only turn it on by default initially for cases we know would
fail with the old code -- when the data is > 2gb
([SPARK-24297](https://issues.apache.org/jira/browse/SPARK-24297)). I've added
unit tests and shared the test I'm doing on a cluster just to find holes in
functionality (posted on the parent jira here:
https://issues.apache.org/jira/browse/SPARK-6235?focusedCommentId=16484069&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16484069).
I have not done load testing yet but plan to. Extra testing, of course,
would certainly be good.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]