[kudu-CR] [tools] Add tool to copy table data to another table

Yingchun Lai (Code Review) Wed, 20 Mar 2019 22:32:24 -0700

Yingchun Lai has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12563 )


Change subject: [tools] Add tool to copy table data to another table
......................................................................


Patch Set 6:

(38 comments)

http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@9
PS5, Line 9: ;
> Nit: use a semicolon (';') here, since the two parts of the sentence are di
Done


http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@11
PS5, Line 11: Alternatively, the tool can
            : create the new table using the same table and partition schema as 
the
            : source table.
> Reword: "Alternatively, the tool can create the new table using the same ta
Done


http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h@139
PS1, Line 139:   struct RangeSchema {
             :     std::vector<ColumnId> column_ids;
             :   };
> Does this need to be public?
The new added range_partition_ids() return RangeSchema now, to keep it the same 
style with hash_partition_schemas() which return std::vector<HashBucketSchema>, 
so it's needed to be public now.


http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/schema.h
File src/kudu/common/schema.h:

http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/schema.h@899
PS1, Line 899:   int find_first_is_deleted_virtual_column() const {
> Is this necessary? Couldn't you get by with CopyWithoutColumnIds(), which c
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc
File src/kudu/common/schema.cc:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc@a348
PS5, Line 348:
> Why do we need to lift this restriction?
Since the function name is 'CopyWithoutColumnIds', I think it's not needed to 
require column ids.
On the other hand, when the source schema is a client side schema which has no 
column ids, the CHECK will fail.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h
File src/kudu/tools/table_scanner.h:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@55
PS5, Line 55:     total_count_(0),
> You can avoid the second constructor by adding the two new params into the
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@64
PS5, Line 64:
            : private:
            :   enum class WorkType {
            :     kScan,
            :     kCopy
> I'd document the entire class as not thread-safe, though I don't think anyo
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@78
PS5, Line 78:   Status AddRow(const 
client::sp::shared_ptr<kudu::client::KuduTable>& table,
            :                 const kud
> Should wrap these in boost::optional to emphasize their optionality.
Done


http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/tools/table_scanner.cc
File src/kudu/tools/table_scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/tools/table_scanner.cc@335
PS1, Line 335:   return Status::OK();
> Curious as to whether we could instead remove column IDs from client schema
I searched the whole kudu project, column IDs used by client side KuduSchema 
seemed only in cases like iterate all columns or key columns, I agree it's 
reasonable to remove it. Instead, we can add some other functions to achieve 
this goal.
However, we must modify some client APIs(i.e. remove KuduSchema::Column(idx)), 
this may introduce some compatibility issues.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc
File src/kudu/tools/table_scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@110
PS5, Line 110:               "For example,\n"
             :               R"*(   ["AND", [
> Nit: reword as "How data should be copied to the destination table. Valid v
Like 'write_type=' can specify an empty value for the flag.
And there is an unit test for it, 'TEST_F(ToolTest, TestCopyTableSchemaOnly)'
gflag validator: done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@318
PS5, Line 318: tArray(pr
> Nit: exists
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@321
PS5, Line 321:
> Nit: "Destination table is as the same as the source table"
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@327
PS5, Line 327:
> Nit: does NOT exist.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@329
PS5, Line 329:
> Nit: does not exist
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@339
PS5, Line 339:
> Nit: rename to "convert_column_ids_to_names".
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@343
PS5, Line 343: () && !s.
> Nit: emplace_back() for new code.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@384
PS5, Line 384:   for (const auto& hash_partition_schema : 
partition_schema.hash_partition_schemas()) {
             :     auto hash_columns = convert_column_ids_to_names(ha
> Store these in unique_ptrs and then release() them into add_range_partition
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@396
PS5, Line 396:
> Nit: "created successfully"
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@401
PS5, Line 401:   for (const auto& partition : partitions) {
> Can we do something simpler here where we copy the raw data from 'row' into
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@403
PS5, Line 403:     const auto& hash_buckets = partition.hash_buckets();
> Nit: rename to src_row and rename write_row to dst_row.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@411
PS5, Line 411:     ScopedDisableRedaction no_redaction;
> This can LOG(FATAL) I think; it'd be a programming error to wind up here.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@546
PS5, Line 546:
> LOG() will automatically terminate messages with endl, so you can remove th
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@552
PS5, Line 552:
> Where is this used?
outdated code, I'll remove it.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@614
PS5, Line 614:
> I think you can reuse much of the code here. Decompose the bulk of the scan
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@640
PS5, Line 640:
> Can do a simple for loop:
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@644
PS5, Line 644:
> With AUTO_FLUSH_BACKGROUND why bother with the explicit Flush()? If you wan
The explicit Flush() is aid to make sure all write operations have been sent.
Close(bool force) will not work correctly here, with 'true' it will drop 
pending operations, while with 'false' it will return an 'IllegalState' status 
simply, I have to call it several times to check the result util it returns 
'OK'.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@668
PS5, Line 668:
> Seems like the bulk of this code be reused via decomposition too.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@726
PS5, Line 726:
> Nit: "Create table only."
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@749
PS5, Line 749:
> Nit: emplace_back for new code.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc
File src/kudu/tools/tool_action_table.cc:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@76
PS5, Line 76: DEFINE_string(dst_table, "",
> Doc what it means if this is the empty string (i.e. if the option is not se
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@139
PS5, Line 139: const char* const kNewColumnNameArg = "new_column_name";
> Nit: reword as "dest_master_addresses"; the 'e' is only one extra char but
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@152
PS5, Line 152: }
             :
             : Status CreateKuduClient(const RunnerContext& context,
             :                         client::sp::shared_ptr<KuduClient>* 
client) {
             :   return CreateKuduClient(context, kMasterAddressesArg, client);
             : }
             :
             : Status DeleteTable(const RunnerContext& context) {
             :
> Doesn't seem like useful decomposotion; just call CreateKuduClient with dif
It's OK for CreateDstClusterKuduClient, however, for CreateKuduClient, I just 
want to keep compatible with old function parameters which is everywhere in 
tools code.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@423
PS5, Line 423:      == FindOrDie(context.required_args, 
kDestMasterAddressesArg)) {
             :     dst_client = src_client;
             :   } else {
> Would be nice to avoid this if copying into the same cluster.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@512
PS5, Line 512:   unique_ptr<Action> copy_table =
> Please update this using the suggestions I left in the commit message.
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@517
PS5, Line 517:                         "could have different partition schemas. 
Alternatively, the tool can "
> Since this tool allows for both a source and a destination cluster, I'd ove
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@518
PS5, Line 518:                         "create the new table using the same 
table and partition schema as the "
> Nit: reword as "Name of the source table"
Done


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@519
PS5, Line 519:                         "source table.")
> Shouldn't this parameter be optional? If not provided, we could assume an i
I think it should be required. Since 'dst_table' is an optional flag, so when 
both 'dst_table' and kDstMasterAddressesArg are not specified, the action of 
this tool will copy itself to itself. And the most use case of this tool is to 
copy data to another cluster in my opinion.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@520
PS5, Line 520:       .AddRequiredParameter({ kMasterAddressesArg,
> Nit: reword as: "Comma-separated list of Kudu Master addresses (destination
Done



--
To view, visit http://gerrit.cloudera.org:8080/12563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifdec51701ac9ec57739b1a6f7c18786294642a54
Gerrit-Change-Number: 12563
Gerrit-PatchSet: 6
Gerrit-Owner: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Comment-Date: Thu, 21 Mar 2019 05:32:05 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tools] Add tool to copy table data to another table

Reply via email to