[
https://issues.apache.org/jira/browse/HIVE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chao Sun updated HIVE-13873:
----------------------------
Description:
This is the grounding work for the nested column pruning in Hive, for Parquet
format. In this patch, we address the case for struct type in select
statements. In particular, for queries such as:
{code}
select s.a from tbl
{code}
where {{tbl}} has schema:
{code}
s:struct<a:int, b:boolean, c:array<int>>
{code}
then only the field {{a}} should have been scanned in the Parquet reader, while
field {{b}} and {{c}} can be ignored.
Future work includes support other types of statements, as well as more
combinations of types (e.g., selecting fields of array type inside a struct
type).
was:Some columnar file formats such as Parquet store fields in struct type
also column by column using encoding described in Google Dramel pager. It's
very common in big data where data are stored in structs while queries only
needs a subset of the the fields in the structs. However, presently Hive still
needs to read the whole struct regardless whether all fields are selected.
Therefore, pruning unwanted sub-fields in struct or nested fields at file
reading time would be a big performance boost for such scenarios.
> Support column pruning for struct fields in select statement
> ------------------------------------------------------------
>
> Key: HIVE-13873
> URL: https://issues.apache.org/jira/browse/HIVE-13873
> Project: Hive
> Issue Type: New Feature
> Components: Logical Optimizer
> Reporter: Xuefu Zhang
> Assignee: Ferdinand Xu
> Attachments: HIVE-13873.1.patch, HIVE-13873.2.patch,
> HIVE-13873.3.patch, HIVE-13873.4.patch, HIVE-13873.5.patch,
> HIVE-13873.6.patch, HIVE-13873.patch, HIVE-13873.wip.patch
>
>
> This is the grounding work for the nested column pruning in Hive, for Parquet
> format. In this patch, we address the case for struct type in select
> statements. In particular, for queries such as:
> {code}
> select s.a from tbl
> {code}
> where {{tbl}} has schema:
> {code}
> s:struct<a:int, b:boolean, c:array<int>>
> {code}
> then only the field {{a}} should have been scanned in the Parquet reader,
> while field {{b}} and {{c}} can be ignored.
> Future work includes support other types of statements, as well as more
> combinations of types (e.g., selecting fields of array type inside a struct
> type).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)