Alex Behm has posted comments on this change. Change subject: PREVIEW: Basic column-wise slot materialization in Parquet scanner. ......................................................................
Patch Set 2: (17 comments) http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 273: /// once. Returns the number of values actually materialized in *num_values. > Can you document how errors are handled in these functions. E.g. what happe Done. Let me know if it needs further clarification. Line 706: while (LIKELY(val_count < max_values && !RowGroupAtEnd() && continue_execution)) { > Is this somewhat redundant? We should know the # of rows so can compute max A row group consists of multiple data blocks. A data block contains metadata about how many values it has. We read and process one data page at a time, so we only know how many values are left in the current data page. The "rows left in row group" quantity is unknown. It would be possible to cap the number of returned values by the number of remaining values in the current data page, but this would complicate the logic in the caller because the ReadValueBatch() calls on the different column readers could return a different num_value that would need to be reconciled somehow. I think the logic is simpler if we guarantee that this function materializes max_values values unless the end of the row group has been reached. We could also do some fancier stuff like reading the metadata of all data pages to get the total number of values in the row group, but that seems like taking it a little too far for this patch. Line 718: ++val_count; > Maybe convert the loop into for (int val_count = 0; val_count < max_values; Agreed, but see explanation/caveats above why the simple loop is actually hard. Line 953: struct HdfsParquetScanner::ScratchTupleBatch { > Does this need to be nested inside the scanner? It seems unnecessary if it' Agreed. Done. Line 979: int64_t size = static_cast<int64_t>(state->batch_size() * tuple_byte_size); > The cast is in the wrong place to prevent overflow. Thanks for pointing that out. Done. Line 1155: pool->AcquireData(decompressed_data_pool_.get(), false); > It looks like you added a lot of extra plumbing in order to change this to That's correct and indeed unfortunate. However, just attaching to the output batch here is not correct. Consider the following sequence of events: 1. Populate scratch batch with 1024 rows 2. Eval conjuncts and transfer 512 rows to output batch 3. Populate scratch batch with 1024 values 4. Eval conjuncts and transfer 512 rows to output batch 5. Output batch is full, so return it. After 4, there could still be up to 512 valid tuples in the scratch batch, but we just cleared their memory because we attached the memory to the output batch. Imo, the lifetime of tuple memory needs to be tied to the lifetime of a scratch batch (and not the output batch) because that's where the memory gets allocated. Line 1728: if (EvalRuntimeFilters(output_row) && ExecNode::EvalConjuncts( > Consider creating a short circuit if there are not RuntimeFilters of Conjuc Added the short circuit and slightly restructured this loop. Let me know if the code is worse than before. Line 1752: if (UNLIKELY( > It seems easy to hoist these checks out of the hot loop and should make the Sorry, I'm not following you suggestion. Can you clarify? There's a modulo in here that relies on a value being updated in every iteration of this loop. Are you suggesting we maintain the runtime filter stats in a separate loop after 1733? Line 1912: ReadCollectionItem(column_readers, materialize_tuple, pool, tuple); > Would it make sense to switch this over to the column-wise version now? I k That's a fair point, so let's discuss it. I had decided against applying the same pattern to nested collections for the following reasons: * Perf improvements to scanning nested collections will affect fewer users at this point, so is less important. * Applying the same pattern is not straightforward, esp. with respect to memory management. Following the same pattern as with the top-level tuple means allocating full scratch batches for every collection slot, and then doing a similar TransferToOutputCollection() type of thing. However, nested collections do not have the indirection of a row as with row batches, so we'd need to copy the surviving tuples into the final output collection, or come up with different mechanism. In some sense the copy can help us because we can then throw away the scratch memory, but it's not clear whether we are winning or losing on perf. The collection case seems different enough and less important to defer it until later. What do you think? Line 1917: if (ExecNode::EvalConjuncts(&conjunct_ctxs[0], conjunct_ctxs.size(), row)) { > Can we have conjuncts on nested collection items? Yes. Line 2564: return Status::OK(); > ? Cruft from experimentation, sorry. Reverted. Line 2893: return Status::OK(); > ? Cruft from experimentation, sorry. Reverted. http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 444: /// Reads data using 'column_readers' to materialize instances of the table-level > table-level or top-level? I think top-level right, you'll use this even for Done Line 452: /// - the row group can be skipped based on runtime filters > I think this is more accurately "the file can be skipped based on runtime f Good point. Done. http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 248: if (literal_count_ == 0) { > Did you mean to change this? Yes. Added comment. http://gerrit.cloudera.org:8080/#/c/2779/2/testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test File testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test: Line 21: 10,10 > FWIW I think technically we should only read the first 50 values of 'int_ar Should be easy to implement that behavior if so desired. Does the spec say anything about what we should do in this case? I think there's an argument to be made for either behavior, but I feel like users would generally prefer to get all their data out of corrupt/malformed files. What do you think? Line 42: ==== > Not your change, but can you add an ERRORS section with the "Column metadat Done -- To view, visit http://gerrit.cloudera.org:8080/2779 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa Gerrit-PatchSet: 2 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Alex Behm <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
