[GitHub] spark pull request: [SPARK-14070] [SQL] Use ORC data source for SQ...

rxin Tue, 22 Mar 2016 14:00:13 -0700

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11891#issuecomment-200022810
  
    This is pretty cool -- the hive path is ridiculously slow. BTW I tried 
comparing Parquet vs ORC based on Spark master branch right now. 
    
    I generated 100 million rows with one double column and one string column:
    
    ```
    sqlContext.range(100 * 1000 * 1000)
      .select(rand().as("numeric"), rand().cast("string").as("string"))
      .write.parquet("testdata/random.parquet")
    ```
    
    then read it back just the string column:
    
    ```
    def measureParquet() {
      val start = System.nanoTime
      
sqlContext.read.parquet("testdata/random.parquet").selectExpr("count(string)").show()
      val end = System.nanoTime
      print((end - start) / 1000 / 1000)
    }
    measureParquet()
    ```
    
    Parquet with gzip compression takes ~12 secs. Parquet with snappy 
compression takes ~7 secs. ORC takes ~24 secs. We can definitely optimize the 
current ORC implementation more too.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14070] [SQL] Use ORC data source for SQ...

Reply via email to