[
https://issues.apache.org/jira/browse/HIVE-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jakub Havlík updated HIVE-10278:
--------------------------------
Priority: Blocker (was: Major)
> Hive does not use Parquet projection to access structures
> ---------------------------------------------------------
>
> Key: HIVE-10278
> URL: https://issues.apache.org/jira/browse/HIVE-10278
> Project: Hive
> Issue Type: Bug
> Components: File Formats, Hive, Physical Optimizer, Query Planning,
> Query Processor, Types
> Affects Versions: 1.0.0
> Environment: CentOS 6.5, Cloudera 2.5.0-cdh5.3.0, 120 nodes in a
> cluster.
> Reporter: Jakub Havlík
> Priority: Blocker
> Labels: performance
>
> Selection from table stored in Parquet format with structures does not uses
> projections as per Parquet specification. This means that reading just one
> item from structure results in reading the whole structure. It was found by
> following test:
> Two tables (one flat one with structures) were created as follows:
> drop table if exists test_flat;
> create table test_flat
> (urlurl string,
> urlvalid boolean,
> urlhost string,
> urldomain string,
> urlsubdomain string,
> urlprotocol string,
> urlsuffix string,
> urlmiddomain string,
> refererurl string,
> referervalid boolean,
> refererhost string,
> refererdomain string,
> referersubdomain string,
> refererprotocol string,
> referersuffix string,
> referermiddomain string)
> stored as parquet
> ;
> drop table if exists test_struct;
> create table test_struct
> (url struct<url:string, valid:boolean, host:string, domain:string,
> subdomain:string, protocol:string, suffix:string, middomain:string>,
> referer struct<url:string, valid:boolean, host:string, domain:string,
> subdomain:string, protocol:string, suffix:string, middomain:string>)
> stored as parquet;
> Size of these tables is:
> [havlik@ams07-015 ~]$ hdfs dfs -du -s -h
> /results/havlik/new_calibration/test_flat/
> 820.4 G 1.6 T /results/havlik/new_calibration/test_flat
> [havlik@ams07-015 ~]$ hdfs dfs -du -s -h
> /results/havlik/new_calibration/test_struct/
> 822.6 G 1.6 T /results/havlik/new_calibration/test_struct
> Flat SELECT:
> select
> count(*)
> from
> test_struct
> where
> url.valid = true
> and referer.valid = true;
> Struct SELECT:
> select
> count(*)
> from
> test_flat
> where
> urlvalid = true
> and referervalid = true;
> CPU time:
> flat: 11785 seconds
> struct: 38004 seconds
> HDFS bytes read:
> flat: 1 812 148 468
> struct: 883 774 856 844 (which is total size of the table)
> Using own MapReduce it is possible to use projections into structures to get
> results similar to flat table. It is clear that Hive needs to implement it as
> it creates unnecessary disk reading and CPU time overhead and cripples
> performance.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)