[jira] [Commented] (HIVE-18810) Parquet Or ORC

Owen O'Malley (JIRA) Tue, 27 Feb 2018 18:59:12 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-18810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379702#comment-16379702
 ]


Owen O'Malley commented on HIVE-18810:
--------------------------------------

It isn't clear that a jira is the best way of documenting this. We should 
probably add a page to either the Hive wiki or the ORC website.

That said, you can see my presentation on the file format benchmarks: 

https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

There has also been a lot of work recently to improve the performance of ORC 
from Spark :

https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-spark-22.html

Other comparisons:

ORC predicate pushdown happens at: file, stripe, and 10,000 rows.
Parquet predicate pushdown happens at: file and stripe

ORC has optional bloom filters, Parquet doesn't. This is related to the 
previous point, because bloom filters only make sense at levels below the 
stripe level.

ORC's type system is much closer to Hive's than Parquet's.


> Parquet Or ORC
> --------------
>
>                 Key: HIVE-18810
>                 URL: https://issues.apache.org/jira/browse/HIVE-18810
>             Project: Hive
>          Issue Type: Test
>          Components: Hive
>    Affects Versions: 1.1.0
>         Environment: Hadoop 1.2.1
> Hive 1.1
>            Reporter: Suddhasatwa Bhaumik
>            Priority: Major
>
> Hello Experts, 
> I would like to know for which data types (based on size and complexity of 
> data) should one be using Parquet or ORC tables in Hive. E.g., On Hadoop 
> 0.20.0 with hive 0.13, the performance of ORC tables in Hive is very good 
> when accessed even by 3rd party BI systems like SAP Business Objects or 
> Tableau; performing the same tests on Hadoop 1.2.1 with Hive 1.1 does not 
> yield such reliability in queries, although ETL or insert/update of tables 
> are taking nominal time the read performance is not within acceptable limits. 
> In case of any queries, kindly advise. 
> Thanks
> [~suddhasatwa_bhaumik]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-18810) Parquet Or ORC

Reply via email to