[ 
https://issues.apache.org/jira/browse/IMPALA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899138#comment-17899138
 ] 

Quanlong Huang commented on IMPALA-3841:
----------------------------------------

Prioritize this since late materialization is not supported yet when there are 
collection types:
{code:cpp}
// Currently, Collection Readers and scalar readers upon collection values
// are not supported for late materialization.
static bool DoesNotSupportLateMaterialization(ParquetColumnReader* 
column_reader) {
  return column_reader->IsCollectionReader() || column_reader->max_rep_level() 
> 0;
}{code}
https://github.com/apache/impala/blob/de6b902581c2f1a9f63ac93de90257e4a4e2f5d3/be/src/exec/parquet/hdfs-parquet-scanner.cc#L240-L244

> Avoid materializing nested collections if top-level predicates already 
> disqualify the row.
> ------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-3841
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3841
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.5.0, Impala 2.6.0
>            Reporter: Alexander Behm
>            Priority: Critical
>              Labels: complextype, nested_types, parquet, performance
>
> Today, we fully materialize a row before evaluating the top-level conjuncts 
> when scanning Parquet. This includes materializing nested collections. We 
> should avoid materializing nested collections if top-level conjuncts already 
> discard the row. Our recent move to column-wise materialization makes this 
> improvement feasible (IMPALA-2736).
> To illustrate the problem, consider this query:
> {code}
> select * from customer c, c.orders o where c.id = 10
> {code}
> Even though we have a very selective predicate on the top-level customer, our 
> scanner will still fully materialize all orders of all customers. The 
> non-matches will be filtered, but we still pay the cost of materializing the 
> orders.
> The proposed improvement is to avoid materializing the orders of 
> non-qualifying customers.
> The improvement will several things:
> * Analyze and separate the top-level conjuncts into those that can be 
> evaluated before materializing the nested collections and those that require 
> nested collections to be materialized. In particular, we need to be careful 
> with our auto-generated !empty() predicates on nested collections.
> * Add a new SkipValues() or similar interface to the Parquet column readers 
> to advances the scanner without actually materializing values. If possible, 
> we should skip entire blocks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to