Tim Armstrong has posted comments on this change. Change subject: PREVIEW: Basic column-wise slot materialization in Parquet scanner. ......................................................................
Patch Set 2: (10 comments) Some initial comments. I'm not sure that I fully understand some aspects of the changes so I'll have to do another pass. http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 273: /// once. Returns the number of values actually materialized in *num_values. Can you document how errors are handled in these functions. E.g. what happens if it hits an error partway through the batch. The documentation was missing before but I think it's even more important now. Line 706: RowGroupAtEnd Is this somewhat redundant? We should know the # of rows so can compute max_values as min(max_values, rows left in group). And if there's an error continue_execution should take care of that. I figure you'll probably get to that when doing the perf optimisations but it would be good now to remove any redundancy in the logic so it's easier to understand. Line 718: val_count Maybe convert the loop into for (int val_count = 0; val_count < max_values; ++val_count) and just break out on exception conditions. It will make it easier to streamline the loop down the track I think. I think the LIKELY annotation should be redundant (if not already) if the loop is in a reasonably canonical form. Line 953: HdfsParquetScanner Does this need to be nested inside the scanner? It seems unnecessary if it's local to the .cc file. Line 979: state->batch_size() * tuple_byte_size The cast is in the wrong place to prevent overflow. I think you want to do a similar calculation to what I added in a recent patch to compute the batch size and allocate the memory. It probably makes sense to change the RowBatch interface so the logic to size the batch is the same between all the scanners: IMPALA-3105: avoid overrunning allocated tuple buffer Line 1752: if (UNLIKELY( It seems easy to hoist these checks out of the hot loop and should make the code more readable, so maybe consider doing that in this patch set. Line 1912: ReadCollectionItem(column_readers, materialize_tuple, pool, tuple); Would it make sense to switch this over to the column-wise version now? I know you're probably doing it eventually but it might make the intermediate state of the code easier to understand if it's not a mix of column-wise and row-wise materialisation. Line 1917: if (ExecNode::EvalConjuncts(&conjunct_ctxs[0], conjunct_ctxs.size(), row)) { Can we have conjuncts on nested collection items? Line 2564: return Status::OK(); ? Line 2893: return Status::OK(); ? -- To view, visit http://gerrit.cloudera.org:8080/2779 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
