Github user bersprockets commented on the issue:
https://github.com/apache/spark/pull/21043
@gatorsmile
On on laptop, running
<pre>
spark.sql("select * from
hive_table").write.mode(SaveMode.Overwrite).csv("outputfile.csv")
</pre>
Input | master<br>runtime | branch<br>runtime
--- | --- | ---
6000 cols, 150k rows | 59 minutes | 2.6 minutes
3000 cols, 150k rows | 13.6 minutes | 1.2 minutes
20 cols, 150k rows | 7.6 seconds | 7.7 seconds
20 cols, 1m rows | 10 seconds | 8.6 second
The branch runtimes are proportional to the number of columns, and also
much faster for a large number of columns (but the same for a small number of
columns).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]