Marcel Kornacker has posted comments on this change. Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input ......................................................................
Patch Set 13: (10 comments) http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc File be/src/exec/hdfs-table-sink.cc: Line 66: tsink.table_sink.hdfs_table_sink.skip_header_line_count : if you break this up, then ...\n ? ...\n : 0 is better Line 212: GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, &key); we're standardizing on nullptr now Line 266: // The rows of this batch may span across multiple files. We repeatedly pass the row remove "across" Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch( rename this function, it's not appending the entire row batch (AppendRows?) Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, RowBatch* batch) { to speed this up, you should: - keep a current_clustered_partition_ - also current_clustered_partition_key_ - check the partition key of the last row in batch whether it matches current_partition_key_ - if so, skip the for loop and write the entire batch out also: change WriteRowsToPartition to take the index vector separately; pass a nullptr in the case where the entire batch gets written to the same partition apply that same optimization to writer::AppendRowBatch (or however you rename that). no need to spend time checking some vector in that case. http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: Line 170: const HdfsPartitionDescriptor& partition_descriptor, const TupleRow* current_row, mention that current_row provides the partition key values. Line 181: /// GetHashTblKey(). Maps to an OutputPartition, which are owned by the object pool and "pool," Line 189: void GetHashTblKey(const TupleRow* current_row, const std::vector<ExprContext*>& ctxs, current_row -> row (it doesn't matter to this function that the row is 'current') same for InitOutputPartition Line 213: /// Appends all rows in batch to the temporary Hdfs files corresponding to partitions. what "corresponding to partitions" refers to is unclear. Line 214: /// The input must be ordered by the partition key expressions. the rows are expected to be ordered...? -- To view, visit http://gerrit.cloudera.org:8080/4863 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
