Alex Behm has posted comments on this change.

Change subject: PREVIEW: Basic column-wise slot materialization in Parquet 
scanner.
......................................................................


Patch Set 2:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 273:   /// once. Returns the number of values actually materialized in 
*num_values.
> Can you document how errors are handled in these functions. E.g. what happe
Done. Let me know if it needs further clarification.


Line 706:     while (LIKELY(val_count < max_values && !RowGroupAtEnd() && 
continue_execution)) {
> Is this somewhat redundant? We should know the # of rows so can compute max
A row group consists of multiple data blocks. A data block contains metadata 
about how many values it has. We read and process one data page at a time, so 
we only know how many values are left in the current data page. The "rows left 
in row group" quantity is unknown.

It would be possible to cap the number of returned values by the number of 
remaining values in the current data page, but this would complicate the logic 
in the caller because the ReadValueBatch() calls on the different column 
readers could return a different num_value that would need to be reconciled 
somehow.

I think the logic is simpler if we guarantee that this function materializes 
max_values  values unless the end of the row group has been reached.

We could also do some fancier stuff like reading the metadata of all data pages 
to get the total number of values in the row group, but that seems like taking 
it a little too far for this patch.


Line 718:       ++val_count;
> Maybe convert the loop into for (int val_count = 0; val_count < max_values;
Agreed, but see explanation/caveats above why the simple loop is actually hard.


Line 953: struct HdfsParquetScanner::ScratchTupleBatch {
> Does this need to be nested inside the scanner? It seems unnecessary if it'
Agreed. Done.


Line 979:     int64_t size = static_cast<int64_t>(state->batch_size() * 
tuple_byte_size);
> The cast is in the wrong place to prevent overflow.
Thanks for pointing that out. Done.


Line 1155:   pool->AcquireData(decompressed_data_pool_.get(), false);
> It looks like you added a lot of extra plumbing in order to change this to 
That's correct and indeed unfortunate. However, just attaching to the output 
batch here is not correct. Consider the following sequence of events:

 1. Populate scratch batch with 1024 rows
2. Eval conjuncts and transfer 512 rows to output batch
3. Populate scratch batch with 1024 values
4. Eval conjuncts and transfer 512 rows to output batch
5. Output batch is full, so return it.

After 4, there could still be up to 512 valid tuples in the scratch batch, but 
we just cleared their memory because we attached the memory to the output batch.

Imo, the lifetime of tuple memory needs to be tied to the lifetime of a scratch 
batch (and not the output batch) because that's where the memory gets allocated.


Line 1728:     if (EvalRuntimeFilters(output_row) && ExecNode::EvalConjuncts(
> Consider creating a short circuit if there are not RuntimeFilters of Conjuc
Added the short circuit and slightly restructured this loop. Let me know if the 
code is worse than before.


Line 1752:     if (UNLIKELY(
> It seems easy to hoist these checks out of the hot loop and should make the
Sorry, I'm not following you suggestion. Can you clarify? There's a modulo in 
here that relies on a value being updated in every iteration of this loop.

Are you suggesting we maintain the runtime filter stats in a separate loop 
after 1733?


Line 1912:           ReadCollectionItem(column_readers, materialize_tuple, 
pool, tuple);
> Would it make sense to switch this over to the column-wise version now? I k
That's a fair point, so let's discuss it. I had decided against applying the 
same pattern to nested collections for the following reasons:
* Perf improvements to scanning nested collections will affect fewer users at 
this point, so is less important.
* Applying the same pattern is not straightforward, esp. with respect to memory 
management. Following the same pattern as with the top-level tuple means 
allocating full scratch batches for every collection slot, and then doing a 
similar TransferToOutputCollection() type of thing. However, nested collections 
do not have the indirection of a row as with row batches, so we'd need to copy 
the surviving tuples into the final output collection, or come up with 
different mechanism. In some sense the copy can help us because we can then 
throw away the scratch memory, but it's not clear whether we are winning or 
losing on perf.

The collection case seems different enough and less important to defer it until 
later. 

What do you think?


Line 1917:         if (ExecNode::EvalConjuncts(&conjunct_ctxs[0], 
conjunct_ctxs.size(), row)) {
> Can we have conjuncts on nested collection items?
Yes.


Line 2564:       return Status::OK();
> ?
Cruft from experimentation, sorry. Reverted.


Line 2893:       return Status::OK();
> ?
Cruft from experimentation, sorry. Reverted.


http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 444:   /// Reads data using 'column_readers' to materialize instances of 
the table-level
> table-level or top-level? I think top-level right, you'll use this even for
Done


Line 452:   /// - the row group can be skipped based on runtime filters
> I think this is more accurately "the file can be skipped based on runtime f
Good point. Done.


http://gerrit.cloudera.org:8080/#/c/2779/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

Line 248:     if (literal_count_ == 0) {
> Did you mean to change this?
Yes. Added comment.


http://gerrit.cloudera.org:8080/#/c/2779/2/testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test:

Line 21: 10,10
> FWIW I think technically we should only read the first 50 values of 'int_ar
Should be easy to implement that behavior if so desired. Does the spec say 
anything about what we should do in this case? I think there's an argument to 
be made for either behavior, but I feel like users would generally prefer to 
get all their data out of corrupt/malformed files. What do you think?


Line 42: ====
> Not your change, but can you add an ERRORS section with the "Column metadat
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Skye Wanderman-Milne <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to