[jira] [Updated] (HIVE-10278) Hive does not use Parquet projection to access structures

JIRA Wed, 15 Apr 2015 01:04:35 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-10278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jakub Havlík updated HIVE-10278:
--------------------------------
    Priority: Blocker  (was: Major)

> Hive does not use Parquet projection to access structures
> ---------------------------------------------------------
>
>                 Key: HIVE-10278
>                 URL: https://issues.apache.org/jira/browse/HIVE-10278
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, Hive, Physical Optimizer, Query Planning, 
> Query Processor, Types
>    Affects Versions: 1.0.0
>         Environment: CentOS 6.5, Cloudera 2.5.0-cdh5.3.0, 120 nodes in a 
> cluster.
>            Reporter: Jakub Havlík
>            Priority: Blocker
>              Labels: performance
>
> Selection from table stored in Parquet format with structures does not uses 
> projections as per Parquet specification. This means that reading just one 
> item from structure results in reading the whole structure. It was found by 
> following test:
> Two tables (one flat one with structures) were created as follows:
> drop table if exists test_flat;
> create table test_flat
>   (urlurl string,
>    urlvalid boolean,
>    urlhost string,
>    urldomain string,
>    urlsubdomain string,
>    urlprotocol string,
>    urlsuffix string,
>    urlmiddomain string,   
>    refererurl string,
>    referervalid boolean,
>    refererhost string,
>    refererdomain string,
>    referersubdomain string,
>    refererprotocol string,
>    referersuffix string,
>    referermiddomain string)
> stored as parquet
> ; 
> drop table if exists test_struct;
> create table test_struct
>   (url struct<url:string, valid:boolean, host:string, domain:string, 
> subdomain:string, protocol:string, suffix:string, middomain:string>,
>    referer struct<url:string, valid:boolean, host:string, domain:string, 
> subdomain:string, protocol:string, suffix:string, middomain:string>)
> stored as parquet; 
> Size of these tables is:
> [havlik@ams07-015 ~]$ hdfs dfs -du -s -h 
> /results/havlik/new_calibration/test_flat/
> 820.4 G  1.6 T  /results/havlik/new_calibration/test_flat
> [havlik@ams07-015 ~]$ hdfs dfs -du -s -h 
> /results/havlik/new_calibration/test_struct/
> 822.6 G  1.6 T  /results/havlik/new_calibration/test_struct
> Flat SELECT:
> select 
>     count(*)
> from 
>     test_struct
> where
>     url.valid = true
>     and referer.valid = true;
> Struct SELECT:
> select 
>     count(*)
> from 
>     test_flat
> where
>     urlvalid = true
>     and referervalid = true;
> CPU time:
> flat: 11785 seconds
> struct: 38004 seconds
> HDFS bytes read:
> flat: 1 812 148 468
> struct: 883 774 856 844 (which is total size of the table)
> Using own MapReduce it is possible to use projections into structures to get 
> results similar to flat table. It is clear that Hive needs to implement it as 
> it creates unnecessary disk reading and CPU time overhead and cripples 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10278) Hive does not use Parquet projection to access structures

Reply via email to