[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

bersprockets Fri, 13 Apr 2018 11:40:52 -0700

Github user bersprockets commented on the issue:

    https://github.com/apache/spark/pull/21043
  
    @gatorsmile 
    
    On on laptop, running
    <pre>
    spark.sql("select * from 
hive_table").write.mode(SaveMode.Overwrite).csv("outputfile.csv") 
    </pre>
    Input | master<br>runtime | branch<br>runtime
    --- | --- | ---
    6000 cols, 150k rows | 59 minutes | 2.6 minutes
    3000 cols, 150k rows | 13.6 minutes | 1.2 minutes
    20 cols, 150k rows | 7.6 seconds | 7.7 seconds
    20 cols, 1m rows |  10 seconds  | 8.6 second
    
    The branch runtimes are proportional to the number of columns, and also 
much faster for a large number of columns (but the same for a small number of 
columns).




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #21043: [SPARK-23963] [SQL] Properly handle large number of colu...

Reply via email to