guojingfeng has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16697


Change subject: IMPALA-10310 Fix couldn't skip rows in parquet file on 
NextRowGroup
......................................................................

IMPALA-10310 Fix couldn't skip rows in parquet file on NextRowGroup

In practice we recommend that hdfs block size should align with parquet row 
group size.
But in fact some compute engine like spark, default parquet row group size is 
128MB,
and if ETL user doesn't change the default property spark will generate row 
groups that
smaller than hdfs block size. The result is a single hdfs block may contain 
multiple
parquet row groups.

In planner stage, length of impala generated scan range may be bigger than row 
group
size. thus a single scan range contains multiple row group. In current parquet 
scanner
when move to next row group, some of internal stat in parquet column readers 
need to
reset. eg: num_buffered_values_, column chunk metadata, reset internal stat of 
column
chunk readers. But current_row_range_ offset is not reset currently, this will 
cause
errors "Couldn't skip rows in file hdfs://xxx" as IMPALA-10310 points out.

This patch simply reset curren_row_range_ to 0 when moving into next row group 
in
parquet column readers. Fix the bug IMPALA-10310.

Testing:
- Ran all core tests offline.
- Manully tested all cases encountered in my production environment.

Change-Id: I964695cd53f5d5fdb6485a85cd82e7a72ca6092c
---
M be/src/exec/parquet/parquet-column-readers.cc
1 file changed, 1 insertion(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/16697/1
--
To view, visit http://gerrit.cloudera.org:8080/16697
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I964695cd53f5d5fdb6485a85cd82e7a72ca6092c
Gerrit-Change-Number: 16697
Gerrit-PatchSet: 1
Gerrit-Owner: guojingfeng <[email protected]>

Reply via email to