Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
+1 for using Dataframe-based version code.
@zhengruifeng One thing I want to confirm is that, I check your testing
code, both RDD-based version and Dataframe-based version code will both cost on
deserialization:
```
...
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()
...
// do `imputer.fit`
```
when running `imputer.fit`, it will extract the required columns from the
cached input dataframe, and then you compare the perf between `RDD.aggregate`
and `dataframe avg`, they both need to deserialize data from input and then do
computation, and `dataframe avg` will take advantage of codegen and should be
faster. But here the test show that RDD version is slower than Dataframe
version, it is not very reasonable, so I want to confirm:
in your RDD version testing, do you cache again when get `RDD` from the
input `Dataframe`?
If not, your testing has no problem, I will guess there exists other
performance issue in SQL layer and cc @cloud-fan to take a look.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]