Thomas Tauber-Marshall has uploaded a new patch set (#2). Change subject: PREVIEW: IMPALA-3742: partitions INSERTs into Kudu tables ......................................................................
PREVIEW: IMPALA-3742: partitions INSERTs into Kudu tables Bulk inserts into Kudu are currently painful because we just send rows randomly, which creates a lot of work for Kudu since it partitions and sorts data before writing, causing writes to be slow. We can alleviate this by sending the rows to Kudu already partitioned and sorted. This patch partitions the rows to insert according to Kudu's partitioning scheme. A followup patch will deal with sorting. It accomplishes this by inserting an exchange node into the plan before the insert and then passing down an Expr to the DataStreamSender that calls into the Kudu client to determine the partition for each row. This has the added benefit of creating a general interface to pass arbitrary partitioning functions to DataStreamSender as Exprs. This patch is a PREVIEW so we can decide if we're happy with the partitioning API Kudu has proposed and get that in on the Kudu side. It does not have any tests, and has not been tested for performance. It also currently only works for tables with a single partition column, due to difficulties with passing arguments into the partitioning Expr. Some potential solutions: 1) Stamp out versions of the KuduPartitioning functions for different numbers of partitioning columns up to a limit. This would be simple but would place a hard limit on the number of partitioning columns in tables we can apply this optimization to. 2) Use the UDF varargs support. This would require casting all of the partitioning columns up to a common type. 3) Add a significant new feature to our UDF API that could be used here, eg. add support for complex types such as Arrays, make it possible to do varargs with a type of AnyVal, introduce a BINARY type, etc. These would all be a significant amount of work, but potentially useful outside of this project. 4) Abandon the idea od passing a partitioning Expr, eg. something like the first version of this review, but cleaned up. 5) Something else entirely. Change-Id: Ic10b3295159354888efcde3df76b0edb24161515 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr.cc A be/src/exprs/partitioning-functions.cc A be/src/exprs/partitioning-functions.h M be/src/runtime/coordinator.cc M be/src/runtime/data-stream-sender.cc M be/src/runtime/data-stream-sender.h M bin/impala-config.sh M common/thrift/Partitions.thrift M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/KuduTable.java M fe/src/main/java/org/apache/impala/planner/DataPartition.java M fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java M fe/src/main/java/org/apache/impala/planner/TableSink.java 15 files changed, 281 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/37/6037/2 -- To view, visit http://gerrit.cloudera.org:8080/6037 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic10b3295159354888efcde3df76b0edb24161515 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]>
