Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/11891#issuecomment-200022810
This is pretty cool -- the hive path is ridiculously slow. BTW I tried
comparing Parquet vs ORC based on Spark master branch right now.
I generated 100 million rows with one double column and one string column:
```
sqlContext.range(100 * 1000 * 1000)
.select(rand().as("numeric"), rand().cast("string").as("string"))
.write.parquet("testdata/random.parquet")
```
then read it back just the string column:
```
def measureParquet() {
val start = System.nanoTime
sqlContext.read.parquet("testdata/random.parquet").selectExpr("count(string)").show()
val end = System.nanoTime
print((end - start) / 1000 / 1000)
}
measureParquet()
```
Parquet with gzip compression takes ~12 secs. Parquet with snappy
compression takes ~7 secs. ORC takes ~24 secs. We can definitely optimize the
current ORC implementation more too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]