[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Marcel Kornacker (Code Review) Mon, 14 Nov 2016 18:40:26 -0800

Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
......................................................................



Patch Set 13:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 66:             tsink.table_sink.hdfs_table_sink.skip_header_line_count :
if you break this up, then
...\n
? ...\n
: 0

is better


Line 212:         GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key);
we're standardizing on nullptr now


Line 266:   // The rows of this batch may span across multiple files. We 
repeatedly pass the row
remove "across"


Line 273:     RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
rename this function, it's not appending the entire row batch (AppendRows?)


Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, 
RowBatch* batch) {
to speed this up, you should:
- keep a current_clustered_partition_
- also current_clustered_partition_key_
- check the partition key of the last row in batch whether it matches 
current_partition_key_
- if so, skip the for loop and write the entire batch out

also: change WriteRowsToPartition to take the index vector separately; pass a 
nullptr in the case where the entire batch gets written to the same partition

apply that same optimization to writer::AppendRowBatch (or however you rename 
that). no need to spend time checking some vector in that case.


http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 170:       const HdfsPartitionDescriptor& partition_descriptor, const 
TupleRow* current_row,
mention that current_row provides the partition key values.


Line 181:   /// GetHashTblKey(). Maps to an OutputPartition, which are owned by 
the object pool and
"pool,"


Line 189:   void GetHashTblKey(const TupleRow* current_row, const 
std::vector<ExprContext*>& ctxs,
current_row -> row (it doesn't matter to this function that the row is 
'current')

same for InitOutputPartition


Line 213:   /// Appends all rows in batch to the temporary Hdfs files 
corresponding to partitions.
what "corresponding to partitions" refers to is unclear.


Line 214:   /// The input must be ordered by the partition key expressions.
the rows are expected to be ordered...?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

Reply via email to