Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/12563 )
Change subject: [tools] Add tool to copy table data to another table ...................................................................... Patch Set 5: (36 comments) Didn't review the new tests yet, had a bunch of comments on the code. http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@9 PS5, Line 9: , Nit: use a semicolon (';') here, since the two parts of the sentence are distinct and separate thoughts. http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@11 PS5, Line 11: And also it's able to create : new table with the same table schema and partition schema with the : source table by this tool. Reword: "Alternatively, the tool can create the new table using the same table and partition schema as the source table." http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h File src/kudu/common/partition.h: http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h@139 PS1, Line 139: struct RangeSchema { : std::vector<ColumnId> column_ids; : }; > Does this need to be public? Could you answer this question? http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc File src/kudu/common/schema.cc: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc@a348 PS5, Line 348: Why do we need to lift this restriction? http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h File src/kudu/tools/table_scanner.h: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@55 PS5, Line 55: TableScanner(client::sp::shared_ptr<KuduClient> client, string table_name, You can avoid the second constructor by adding the two new params into the existing constructor with default values of boost::none (if you wrap them with boost::optional). http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@64 PS5, Line 64: // Not thread-safe. : Status StartScan(); : : // Not thread-safe. : Status StartCopy(); I'd document the entire class as not thread-safe, though I don't think anyone would expect it to be. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@78 PS5, Line 78: client::sp::shared_ptr<KuduClient> dst_client_; : string dst_table_name_; Should wrap these in boost::optional to emphasize their optionality. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc File src/kudu/tools/table_scanner.cc: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@110 PS5, Line 110: "Table data write type, 'insert', 'upsert' or empty string, while empty string means " : "not copy data." Nit: reword as "How data should be copied to the destination table. Valid values are 'insert', 'upsert' or the empty string. If the empty string, data will not be copied (useful when create_table is 'true')." How do you actually specify the empty string though? Is there a test for this? Also please add a gflag validator to enforce that the flag's value is one of those accepted values. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@318 PS5, Line 318: is exist. Nit: exists http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@321 PS5, Line 321: Destination table is the same with the source. Nit: "Destination table is as the same as the source table" http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@327 PS5, Line 327: is NOT exist. Nit: does NOT exist. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@329 PS5, Line 329: is not Nit: does not exist http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@339 PS5, Line 339: ColumnNames Nit: rename to "convert_column_ids_to_names". http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@343 PS5, Line 343: push_back Nit: emplace_back() for new code. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@384 PS5, Line 384: auto lower = new KuduPartialRow(&schema_internal); : auto upper = new KuduPartialRow(&schema_internal); Store these in unique_ptrs and then release() them into add_range_partition. That'll prevent a memory leak if a DecodeRangeKey call fails. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@396 PS5, Line 396: create success Nit: "created successfully" http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@401 PS5, Line 401: Status AddRow(const client::sp::shared_ptr<KuduTable>& table, Can we do something simpler here where we copy the raw data from 'row' into 'write_row'? Shouldn't need to expand this for every type since this is an internal tool; we can poke holes with friendship in KuduPartialRow and/or KuduScanBatch::RowPtr if need be. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@403 PS5, Line 403: const KuduScanBatch::RowPtr& row, Nit: rename to src_row and rename write_row to dst_row. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@411 PS5, Line 411: return Status::InvalidArgument(Substitute("invalid write_type: $0", FLAGS_write_type)); This can LOG(FATAL) I think; it'd be a programming error to wind up here. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@546 PS5, Line 546: LOG(ERROR) << error->status().ToString() << endl; LOG() will automatically terminate messages with endl, so you can remove this. Elsewhere too. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@552 PS5, Line 552: KuduValue* ParseValue(KuduColumnSchema::DataType type, Where is this used? http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@614 PS5, Line 614: void TableScanner::CopyTask(const vector<KuduScanToken*>& tokens) { I think you can reuse much of the code here. Decompose the bulk of the scan loop into a separate method that takes a functor as an argument. The idea is to call the functor for each row. You could parameterize the "scanned" vs. "copied" string too, if that's important to you. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@640 PS5, Line 640: for (auto it = batch.begin(); it != batch.end(); ++it) { Can do a simple for loop: for (const auto& row : batch) http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@644 PS5, Line 644: Status s = session->Flush(); With AUTO_FLUSH_BACKGROUND why bother with the explicit Flush()? If you want to check for errors, periodically poll the error collector. You probably want to close the session at the end though. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@668 PS5, Line 668: Status TableScanner::StartScan() { Seems like the bulk of this code be reused via decomposition too. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@726 PS5, Line 726: // Copy schema only. Nit: "Create table only." http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@749 PS5, Line 749: push_back Nit: emplace_back for new code. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc File src/kudu/tools/tool_action_table.cc: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@76 PS5, Line 76: "The name of the destination table the data will be copied to."); Doc what it means if this is the empty string (i.e. if the option is not set). http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@139 PS5, Line 139: const char* const kDstMasterAddressesArg = "dst_master_addresses"; Nit: reword as "dest_master_addresses"; the 'e' is only one extra char but helps emphasize that the abbreviation is short for "destination". http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@152 PS5, Line 152: Status CreateKuduClient(const RunnerContext& context, : client::sp::shared_ptr<KuduClient>* client) { : return CreateKuduClient(context, kMasterAddressesArg, client); : } : : Status CreateDstClusterKuduClient(const RunnerContext& context, : client::sp::shared_ptr<KuduClient>* client) { : return CreateKuduClient(context, kDstMasterAddressesArg, client); : } Doesn't seem like useful decomposotion; just call CreateKuduClient with different args directly. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@423 PS5, Line 423: : client::sp::shared_ptr<KuduClient> dst_client; : RETURN_NOT_OK(CreateDstClusterKuduClient(context, &dst_client)); Would be nice to avoid this if copying into the same cluster. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@512 PS5, Line 512: .ExtraDescription("Copy table data to another table, the two tables could be in the same " Please update this using the suggestions I left in the commit message. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@517 PS5, Line 517: .AddRequiredParameter({ kMasterAddressesArg, kMasterAddressesArgDesc }) Since this tool allows for both a source and a destination cluster, I'd override the description of this parameter with something like: "Comma-separated list of Kudu Master addresses (source) where each address is of form 'hostname:port'". http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@518 PS5, Line 518: .AddRequiredParameter({ kTableNameArg, "Name of the table copied from." }) Nit: reword as "Name of the source table" http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@519 PS5, Line 519: .AddRequiredParameter({ kDstMasterAddressesArg, Shouldn't this parameter be optional? If not provided, we could assume an intra-cluster table copy. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@520 PS5, Line 520: "The master addresses of the destination cluster " Nit: reword as: "Comma-separated list of Kudu Master addresses (destination) where each address is of form 'hostname:port'". -- To view, visit http://gerrit.cloudera.org:8080/12563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifdec51701ac9ec57739b1a6f7c18786294642a54 Gerrit-Change-Number: 12563 Gerrit-PatchSet: 5 Gerrit-Owner: Yingchun Lai <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Comment-Date: Fri, 15 Mar 2019 19:22:05 +0000 Gerrit-HasComments: Yes
