Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Oh. That's what have done in the old PR #18902 .(Because the RDD version
(not in master branch, only personal impl here (sorry for put wrong link, the
code link is here:
https://github.com/apache/spark/pull/18902/commits/8daffc9007c65f04e005ffe5dcfbeca634480465)
will be faster than dataframe version based on current spark. Now your PR has
some improvement on the perf, I would like to compare them again. We hope to
track this performance gap and try to resolve it in the future. According to my
other similar case, now the dataframe version will be about 2-3x slower than
RDD version in the case numCols==100 for now. But if you have no time, I can
help do it. Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]