Skye Wanderman-Milne has posted comments on this change. Change subject: PREVIEW: Basic column-wise slot materialization in Parquet scanner. ......................................................................
Patch Set 2: (6 comments) A few initial comments, will look more later. Feel free to just answer my questions if you don't want to post a new patch yet. http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 1155: pool->AcquireData(decompressed_data_pool_.get(), false); It looks like you added a lot of extra plumbing in order to change this to the scratch batch's pool instead of the main rowbatch's pool. Is there a reason you can't leave it as-is? I think this will make it easier to understand the mem mgmt since it's more direct. http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 444: /// Reads data using 'column_readers' to materialize instances of the table-level table-level or top-level? I think top-level right, you'll use this even for queries like "select item from tbl.array_col"? I try to use table-level = table-level values and top-level = rowbatch values (which might be from table or collection columns) Line 452: /// - the row group can be skipped based on runtime filters I think this is more accurately "the file can be skipped based on runtime filters". I found "row group" confusing here since we stop evaluating any subsequent row groups in this case. http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 248: if (literal_count_ == 0) { Did you mean to change this? http://gerrit.cloudera.org:8080/#/c/2779/2/testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test File testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test: Line 21: 10,10 FWIW I think technically we should only read the first 50 values of 'int_array' since that's what the column metadata says. Given that this behavior is already wrong and it's a weird error case though, this seems OK to me. If anything this is an improvement since we get the last value out, which is correct in the below case where the column metadata states there's an extra column. Line 42: ==== Not your change, but can you add an ERRORS section with the "Column metadata states there are 50 values" error? (And the other errors if they show up, not sure if they're deduped in this case.) -- To view, visit http://gerrit.cloudera.org:8080/2779 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
