Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12563 )

Change subject: [tools] Add tool to copy table data to another table
......................................................................


Patch Set 5:

(36 comments)

Didn't review the new tests yet, had a bunch of comments on the code.

http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@9
PS5, Line 9: ,
Nit: use a semicolon (';') here, since the two parts of the sentence are 
distinct and separate thoughts.


http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@11
PS5, Line 11: And also it's able to create
            : new table with the same table schema and partition schema with the
            : source table by this tool.
Reword: "Alternatively, the tool can create the new table using the same table 
and partition schema as the source table."


http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h@139
PS1, Line 139:   struct RangeSchema {
             :     std::vector<ColumnId> column_ids;
             :   };
> Does this need to be public?
Could you answer this question?


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc
File src/kudu/common/schema.cc:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc@a348
PS5, Line 348:
Why do we need to lift this restriction?


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h
File src/kudu/tools/table_scanner.h:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@55
PS5, Line 55:   TableScanner(client::sp::shared_ptr<KuduClient> client, string 
table_name,
You can avoid the second constructor by adding the two new params into the 
existing constructor with default values of boost::none (if you wrap them with 
boost::optional).


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@64
PS5, Line 64:   // Not thread-safe.
            :   Status StartScan();
            :
            :   // Not thread-safe.
            :   Status StartCopy();
I'd document the entire class as not thread-safe, though I don't think anyone 
would expect it to be.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@78
PS5, Line 78:   client::sp::shared_ptr<KuduClient> dst_client_;
            :   string dst_table_name_;
Should wrap these in boost::optional to emphasize their optionality.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc
File src/kudu/tools/table_scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@110
PS5, Line 110:               "Table data write type, 'insert', 'upsert' or 
empty string, while empty string means "
             :               "not copy data."
Nit: reword as "How data should be copied to the destination table. Valid 
values are 'insert', 'upsert' or the empty string. If the empty string, data 
will not be copied (useful when create_table is 'true')."

How do you actually specify the empty string though? Is there a test for this?

Also please add a gflag validator to enforce that the flag's value is one of 
those accepted values.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@318
PS5, Line 318: is exist.
Nit: exists


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@321
PS5, Line 321: Destination table is the same with the source.
Nit: "Destination table is as the same as the source table"


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@327
PS5, Line 327: is NOT exist.
Nit: does NOT exist.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@329
PS5, Line 329: is not
Nit: does not exist


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@339
PS5, Line 339: ColumnNames
Nit: rename to "convert_column_ids_to_names".


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@343
PS5, Line 343: push_back
Nit: emplace_back() for new code.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@384
PS5, Line 384:     auto lower = new KuduPartialRow(&schema_internal);
             :     auto upper = new KuduPartialRow(&schema_internal);
Store these in unique_ptrs and then release() them into add_range_partition. 
That'll prevent a memory leak if a DecodeRangeKey call fails.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@396
PS5, Line 396: create success
Nit: "created successfully"


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@401
PS5, Line 401: Status AddRow(const client::sp::shared_ptr<KuduTable>& table,
Can we do something simpler here where we copy the raw data from 'row' into 
'write_row'? Shouldn't need to expand this for every type since this is an 
internal tool; we can poke holes with friendship in KuduPartialRow and/or 
KuduScanBatch::RowPtr if need be.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@403
PS5, Line 403:               const KuduScanBatch::RowPtr& row,
Nit: rename to src_row and rename write_row to dst_row.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@411
PS5, Line 411:     return Status::InvalidArgument(Substitute("invalid 
write_type: $0", FLAGS_write_type));
This can LOG(FATAL) I think; it'd be a programming error to wind up here.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@546
PS5, Line 546:     LOG(ERROR) << error->status().ToString() << endl;
LOG() will automatically terminate messages with endl, so you can remove this.

Elsewhere too.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@552
PS5, Line 552: KuduValue* ParseValue(KuduColumnSchema::DataType type,
Where is this used?


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@614
PS5, Line 614: void TableScanner::CopyTask(const vector<KuduScanToken*>& 
tokens) {
I think you can reuse much of the code here. Decompose the bulk of the scan 
loop into a separate method that takes a functor as an argument. The idea is to 
call the functor for each row. You could parameterize the "scanned" vs. 
"copied" string too, if that's important to you.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@640
PS5, Line 640:       for (auto it = batch.begin(); it != batch.end(); ++it) {
Can do a simple for loop:

  for (const auto& row : batch)


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@644
PS5, Line 644:       Status s = session->Flush();
With AUTO_FLUSH_BACKGROUND why bother with the explicit Flush()? If you want to 
check for errors, periodically poll the error collector.

You probably want to close the session at the end though.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@668
PS5, Line 668: Status TableScanner::StartScan() {
Seems like the bulk of this code be reused via decomposition too.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@726
PS5, Line 726:     // Copy schema only.
Nit: "Create table only."


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@749
PS5, Line 749: push_back
Nit: emplace_back for new code.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc
File src/kudu/tools/tool_action_table.cc:

http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@76
PS5, Line 76:               "The name of the destination table the data will be 
copied to.");
Doc what it means if this is the empty string (i.e. if the option is not set).


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@139
PS5, Line 139: const char* const kDstMasterAddressesArg = 
"dst_master_addresses";
Nit: reword as "dest_master_addresses"; the 'e' is only one extra char but 
helps emphasize that the abbreviation is short for "destination".


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@152
PS5, Line 152: Status CreateKuduClient(const RunnerContext& context,
             :                         client::sp::shared_ptr<KuduClient>* 
client) {
             :   return CreateKuduClient(context, kMasterAddressesArg, client);
             : }
             :
             : Status CreateDstClusterKuduClient(const RunnerContext& context,
             :                                   
client::sp::shared_ptr<KuduClient>* client) {
             :   return CreateKuduClient(context, kDstMasterAddressesArg, 
client);
             : }
Doesn't seem like useful decomposotion; just call CreateKuduClient with 
different args directly.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@423
PS5, Line 423:
             :   client::sp::shared_ptr<KuduClient> dst_client;
             :   RETURN_NOT_OK(CreateDstClusterKuduClient(context, 
&dst_client));
Would be nice to avoid this if copying into the same cluster.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@512
PS5, Line 512:       .ExtraDescription("Copy table data to another table, the 
two tables could be in the same "
Please update this using the suggestions I left in the commit message.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@517
PS5, Line 517:       .AddRequiredParameter({ kMasterAddressesArg, 
kMasterAddressesArgDesc })
Since this tool allows for both a source and a destination cluster, I'd 
override the description of this parameter with something like: 
"Comma-separated list of Kudu Master addresses (source) where each address is 
of form 'hostname:port'".


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@518
PS5, Line 518:       .AddRequiredParameter({ kTableNameArg, "Name of the table 
copied from." })
Nit: reword as "Name of the source table"


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@519
PS5, Line 519:       .AddRequiredParameter({ kDstMasterAddressesArg,
Shouldn't this parameter be optional? If not provided, we could assume an 
intra-cluster table copy.


http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@520
PS5, Line 520:                               "The master addresses of the 
destination cluster "
Nit: reword as: "Comma-separated list of Kudu Master addresses (destination) 
where each address is of form 'hostname:port'".



--
To view, visit http://gerrit.cloudera.org:8080/12563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifdec51701ac9ec57739b1a6f7c18786294642a54
Gerrit-Change-Number: 12563
Gerrit-PatchSet: 5
Gerrit-Owner: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Comment-Date: Fri, 15 Mar 2019 19:22:05 +0000
Gerrit-HasComments: Yes

Reply via email to