[Impala-CR](cdh5-trunk) IMPALA-3286: Prefetching for PHJ probing.

Dan Hecht (Code Review) Wed, 11 May 2016 21:11:25 -0700

Dan Hecht has posted comments on this change.

Change subject: IMPALA-3286: Prefetching for PHJ probing.
......................................................................



Patch Set 3:

(41 comments)

Didn't quite get through it all yet, but here's the first set of comments. 
mostly renaming to help make it easy to follow the code, nothing major in terms 
of structure.

However, later on I do think we should try reorganizing things so that we have 
a:

struct HashTableParams {
   uint32_t hash;
   bool is_null;
   uint8_t expr_values_[];
}

And then make what you call the ValuesBuffer be an array of those things that 
only the exec nodes know about, and then the hash table interfaces just takes a 
HashTableParams* as input.  Basically, what we had talked about as the follow 
on cleanup. But like I said, let's try that after we get this approach checked 
in.

http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/hash-table.cc
File be/src/exec/hash-table.cc:

Line 172: non_var_len = values_buffer_.var_result_begin();
it's confusing to switch polarities like this. either call the local variable 
var_result_begin or change the field to be fixed_len. and actually, calling it 
non_var_len or fixed_len doesn't really make sense since it's -1 when there is 
no var stuff but there is still fixed stuff.


Line 178: bytes
don't use bytes here, just like mentioned in hash-table.h


http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/hash-table.h
File be/src/exec/hash-table.h:

Line 169: Last
last no longer makes sense because it's not the last row that was processed.  
How about just ExprValue(int expr_idx)? or Just make the code get the value 
from the cache directly.


Line 174: Last
same comment


Line 211: ValuesBuffer
How about ExprValuesCache?  In the comments, you already refer to it as a 
cache, and "values" seems a little generic.


Line 228: Move the pointers to point the next entries in the arrays of cached 
values
Advance the pointers to the next row's cached state.


Line 258: IsRowNull
IsRowIgnored()?


Line 269: ExprHashValue
ExprValuesHash (i.e. it's the hash over the set of values for the row).


Line 272: SetExprHashValue
SetExprValuesHash


Line 275: ExprValuesBufferLoc
ExprValuePtr(int expr_idx). not plural since it gives back one expr value, not 
the whole row.


Line 278: ExprValuesNullByteLoc
ExprValueNullPtr


Line 290: results_buffer_size_
how about calling this: expr_values_bytes_per_row_?


Line 293: build_expr_size_
num_build_exprs_


Line 300:     bool is_read_;
Rather than having different modes (which then requires a branch at every 
NextRow()) you could just have PrepareForRead() remember the ptr 
expr_values_end_ = cur_expr_values_ + expr_values_bytes_per_row_
and then make AtEnd() { return cur_expr_values_ == expr_values_end_; }
Then you wouldn't even need num_rows_ either, saving another increment. (and 
num_rows_ is kinda hard to explain).


Line 304: expr_values_buffer_
how about cur_expr_values_, to signal that this is the current values (i.e. 
this field changes as we iterate).


Line 308: expr_values_null_byte_
cur_expr_values_null_


Line 312: expr_hash_value_
cur_expr_values_hash_ (i.e. it's the hash for the current row's expr_values).


Line 316: expr_values_buffer_array_
expr_values_array_


Line 321: uint8_t
why not make this an array of bool to be clear that it's one value per, rather 
than a bitset or something.


Line 321: expr_values_null_byte_array_
expr_values_null_bool_array_


Line 324: expr_hash_value_array_
expr_values_hash_array_ for consistency


Line 328: null_bitmap_
this is kinda misleading since it doesn't really tell you if the row was null.  
How about: ignored_rows_?


Line 331:     /// be a build row (during Insert()) or probe row (during 
FindProbeRow()).
this comment is misplaced


Line 332: expr_values_buffer_offsets_
expr_values_offsets_


Line 337: var_result_begin_
var_result_offset_ (to be consistent with expr_values_offsets_ since this is an 
offset into the current expr values right?)


Line 389:   void IR_ALWAYS_INLINE ResetPointers();
seems like this should be deleted


Line 392: MAX_EXPR_VALUES_ARRAY_SIZE
move to ExprValuesCache?


Line 487:   };
if these are the values for the query option, you should define this in thrift 
so it can be used by the query option parsing code.


http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

Line 41: results_buffer_size_
how about calling this: expr_values_bytes_per_row_


Line 98:   uint32_t hash = ht_ctx->values_buffer()->ExprHashValue();
might as well move this into Probe() and remove from all the callers.


Line 326:     DCHECK_EQ((bucket_idx_ & ~(table_->num_buckets_ - 1)), 0);
why not move this DCHECK into PrefetchBucket()?


http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/partitioned-hash-join-node-ir.cc
File be/src/exec/partitioned-hash-join-node-ir.cc:

Line 347:     ++i;
in the other loop you used ->CurIdx() instead.


http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 653:   BufferedTupleStream* build_rows[PARTITION_FANOUT];
why do we do this?


Line 986:     RETURN_IF_ERROR(QueryMaintenance(state));
don't we still need to free build side local allocations more often because of 
the problem we discussed yesterday?


http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

Line 166: BufferedTupleStream** build_rows
what is this?


Line 260: HashTableCtx* ctx,
        :       HashTableCtx::ValuesBuffer* vals_buf
do we have to pass both?


Line 273:   /// Prefetches the hash table buckets of the partition which 'hash' 
maps to.
is that true? doesn't it just prefetch one bucket? i.e.:

Prefetches the bucket in the partition's hash table for the given 'hash'.


Line 285: int
can't we use the enum type rather than int?


Line 285:       int prefetch_mode);
prefetch_mode first since it's an in parameter.


Line 289:       HashTableCtx* ht_ctx, Status* status, int prefetch_mode);
same, keep in params and out params together.


http://gerrit.cloudera.org:8080/#/c/2959/3/be/src/exec/partitioned-hash-join-node.inline.h
File be/src/exec/partitioned-hash-join-node.inline.h:

Line 29: PrepareForWrite
SetAtEnd()


-- 
To view, visit http://gerrit.cloudera.org:8080/2959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ib42b93d99d09c833571e39d20d58c11ef73f3cc0
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Michael Ho <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3286: Prefetching for PHJ probing.

Reply via email to