Skye Wanderman-Milne has posted comments on this change.

Change subject: PREVIEW: Basic column-wise slot materialization in Parquet 
scanner.
......................................................................


Patch Set 2:

(6 comments)

A few initial comments, will look more later. Feel free to just answer my 
questions if you don't want to post a new patch yet.

http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 1155:   pool->AcquireData(decompressed_data_pool_.get(), false);
It looks like you added a lot of extra plumbing in order to change this to the 
scratch batch's pool instead of the main rowbatch's pool. Is there a reason you 
can't leave it as-is? I think this will make it easier to understand the mem 
mgmt since it's more direct.


http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 444:   /// Reads data using 'column_readers' to materialize instances of 
the table-level
table-level or top-level? I think top-level right, you'll use this even for 
queries like "select item from tbl.array_col"? I try to use table-level = 
table-level values and top-level = rowbatch values (which might be from table 
or collection columns)


Line 452:   /// - the row group can be skipped based on runtime filters
I think this is more accurately "the file can be skipped based on runtime 
filters". I found "row group" confusing here since we stop evaluating any 
subsequent row groups in this case.


http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

Line 248:     if (literal_count_ == 0) {
Did you mean to change this?


http://gerrit.cloudera.org:8080/#/c/2779/2/testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test:

Line 21: 10,10
FWIW I think technically we should only read the first 50 values of 'int_array' 
since that's what the column metadata says. Given that this behavior is already 
wrong and it's a weird error case though, this seems OK to me. If anything this 
is an improvement since we get the last value out, which is correct in the 
below case where the column metadata states there's an extra column.


Line 42: ====
Not your change, but can you add an ERRORS section with the "Column metadata 
states there are 50 values" error? (And the other errors if they show up, not 
sure if they're deduped in this case.)


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to