[
https://issues.apache.org/jira/browse/HIVE-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379702#comment-16379702
]
Owen O'Malley commented on HIVE-18810:
--------------------------------------
It isn't clear that a jira is the best way of documenting this. We should
probably add a page to either the Hive wiki or the ORC website.
That said, you can see my presentation on the file format benchmarks:
https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
There has also been a lot of work recently to improve the performance of ORC
from Spark :
https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-spark-22.html
Other comparisons:
ORC predicate pushdown happens at: file, stripe, and 10,000 rows.
Parquet predicate pushdown happens at: file and stripe
ORC has optional bloom filters, Parquet doesn't. This is related to the
previous point, because bloom filters only make sense at levels below the
stripe level.
ORC's type system is much closer to Hive's than Parquet's.
> Parquet Or ORC
> --------------
>
> Key: HIVE-18810
> URL: https://issues.apache.org/jira/browse/HIVE-18810
> Project: Hive
> Issue Type: Test
> Components: Hive
> Affects Versions: 1.1.0
> Environment: Hadoop 1.2.1
> Hive 1.1
> Reporter: Suddhasatwa Bhaumik
> Priority: Major
>
> Hello Experts,
> I would like to know for which data types (based on size and complexity of
> data) should one be using Parquet or ORC tables in Hive. E.g., On Hadoop
> 0.20.0 with hive 0.13, the performance of ORC tables in Hive is very good
> when accessed even by 3rd party BI systems like SAP Business Objects or
> Tableau; performing the same tests on Hadoop 1.2.1 with Hive 1.1 does not
> yield such reliability in queries, although ETL or insert/update of tables
> are taking nominal time the read performance is not within acceptable limits.
> In case of any queries, kindly advise.
> Thanks
> [~suddhasatwa_bhaumik]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)