Yingchun Lai has posted comments on this change. ( http://gerrit.cloudera.org:8080/12563 )
Change subject: [tools] Add tool to copy table data to another table ...................................................................... Patch Set 6: (38 comments) http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@9 PS5, Line 9: ; > Nit: use a semicolon (';') here, since the two parts of the sentence are di Done http://gerrit.cloudera.org:8080/#/c/12563/5//COMMIT_MSG@11 PS5, Line 11: Alternatively, the tool can : create the new table using the same table and partition schema as the : source table. > Reword: "Alternatively, the tool can create the new table using the same ta Done http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h File src/kudu/common/partition.h: http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/partition.h@139 PS1, Line 139: struct RangeSchema { : std::vector<ColumnId> column_ids; : }; > Does this need to be public? The new added range_partition_ids() return RangeSchema now, to keep it the same style with hash_partition_schemas() which return std::vector<HashBucketSchema>, so it's needed to be public now. http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/schema.h File src/kudu/common/schema.h: http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/common/schema.h@899 PS1, Line 899: int find_first_is_deleted_virtual_column() const { > Is this necessary? Couldn't you get by with CopyWithoutColumnIds(), which c Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc File src/kudu/common/schema.cc: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/common/schema.cc@a348 PS5, Line 348: > Why do we need to lift this restriction? Since the function name is 'CopyWithoutColumnIds', I think it's not needed to require column ids. On the other hand, when the source schema is a client side schema which has no column ids, the CHECK will fail. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h File src/kudu/tools/table_scanner.h: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@55 PS5, Line 55: total_count_(0), > You can avoid the second constructor by adding the two new params into the Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@64 PS5, Line 64: : private: : enum class WorkType { : kScan, : kCopy > I'd document the entire class as not thread-safe, though I don't think anyo Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.h@78 PS5, Line 78: Status AddRow(const client::sp::shared_ptr<kudu::client::KuduTable>& table, : const kud > Should wrap these in boost::optional to emphasize their optionality. Done http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/tools/table_scanner.cc File src/kudu/tools/table_scanner.cc: http://gerrit.cloudera.org:8080/#/c/12563/1/src/kudu/tools/table_scanner.cc@335 PS1, Line 335: return Status::OK(); > Curious as to whether we could instead remove column IDs from client schema I searched the whole kudu project, column IDs used by client side KuduSchema seemed only in cases like iterate all columns or key columns, I agree it's reasonable to remove it. Instead, we can add some other functions to achieve this goal. However, we must modify some client APIs(i.e. remove KuduSchema::Column(idx)), this may introduce some compatibility issues. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc File src/kudu/tools/table_scanner.cc: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@110 PS5, Line 110: "For example,\n" : R"*( ["AND", [ > Nit: reword as "How data should be copied to the destination table. Valid v Like 'write_type=' can specify an empty value for the flag. And there is an unit test for it, 'TEST_F(ToolTest, TestCopyTableSchemaOnly)' gflag validator: done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@318 PS5, Line 318: tArray(pr > Nit: exists Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@321 PS5, Line 321: > Nit: "Destination table is as the same as the source table" Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@327 PS5, Line 327: > Nit: does NOT exist. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@329 PS5, Line 329: > Nit: does not exist Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@339 PS5, Line 339: > Nit: rename to "convert_column_ids_to_names". Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@343 PS5, Line 343: () && !s. > Nit: emplace_back() for new code. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@384 PS5, Line 384: for (const auto& hash_partition_schema : partition_schema.hash_partition_schemas()) { : auto hash_columns = convert_column_ids_to_names(ha > Store these in unique_ptrs and then release() them into add_range_partition Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@396 PS5, Line 396: > Nit: "created successfully" Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@401 PS5, Line 401: for (const auto& partition : partitions) { > Can we do something simpler here where we copy the raw data from 'row' into Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@403 PS5, Line 403: const auto& hash_buckets = partition.hash_buckets(); > Nit: rename to src_row and rename write_row to dst_row. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@411 PS5, Line 411: ScopedDisableRedaction no_redaction; > This can LOG(FATAL) I think; it'd be a programming error to wind up here. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@546 PS5, Line 546: > LOG() will automatically terminate messages with endl, so you can remove th Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@552 PS5, Line 552: > Where is this used? outdated code, I'll remove it. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@614 PS5, Line 614: > I think you can reuse much of the code here. Decompose the bulk of the scan Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@640 PS5, Line 640: > Can do a simple for loop: Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@644 PS5, Line 644: > With AUTO_FLUSH_BACKGROUND why bother with the explicit Flush()? If you wan The explicit Flush() is aid to make sure all write operations have been sent. Close(bool force) will not work correctly here, with 'true' it will drop pending operations, while with 'false' it will return an 'IllegalState' status simply, I have to call it several times to check the result util it returns 'OK'. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@668 PS5, Line 668: > Seems like the bulk of this code be reused via decomposition too. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@726 PS5, Line 726: > Nit: "Create table only." Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/table_scanner.cc@749 PS5, Line 749: > Nit: emplace_back for new code. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc File src/kudu/tools/tool_action_table.cc: http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@76 PS5, Line 76: DEFINE_string(dst_table, "", > Doc what it means if this is the empty string (i.e. if the option is not se Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@139 PS5, Line 139: const char* const kNewColumnNameArg = "new_column_name"; > Nit: reword as "dest_master_addresses"; the 'e' is only one extra char but Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@152 PS5, Line 152: } : : Status CreateKuduClient(const RunnerContext& context, : client::sp::shared_ptr<KuduClient>* client) { : return CreateKuduClient(context, kMasterAddressesArg, client); : } : : Status DeleteTable(const RunnerContext& context) { : > Doesn't seem like useful decomposotion; just call CreateKuduClient with dif It's OK for CreateDstClusterKuduClient, however, for CreateKuduClient, I just want to keep compatible with old function parameters which is everywhere in tools code. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@423 PS5, Line 423: == FindOrDie(context.required_args, kDestMasterAddressesArg)) { : dst_client = src_client; : } else { > Would be nice to avoid this if copying into the same cluster. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@512 PS5, Line 512: unique_ptr<Action> copy_table = > Please update this using the suggestions I left in the commit message. Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@517 PS5, Line 517: "could have different partition schemas. Alternatively, the tool can " > Since this tool allows for both a source and a destination cluster, I'd ove Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@518 PS5, Line 518: "create the new table using the same table and partition schema as the " > Nit: reword as "Name of the source table" Done http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@519 PS5, Line 519: "source table.") > Shouldn't this parameter be optional? If not provided, we could assume an i I think it should be required. Since 'dst_table' is an optional flag, so when both 'dst_table' and kDstMasterAddressesArg are not specified, the action of this tool will copy itself to itself. And the most use case of this tool is to copy data to another cluster in my opinion. http://gerrit.cloudera.org:8080/#/c/12563/5/src/kudu/tools/tool_action_table.cc@520 PS5, Line 520: .AddRequiredParameter({ kMasterAddressesArg, > Nit: reword as: "Comma-separated list of Kudu Master addresses (destination Done -- To view, visit http://gerrit.cloudera.org:8080/12563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifdec51701ac9ec57739b1a6f7c18786294642a54 Gerrit-Change-Number: 12563 Gerrit-PatchSet: 6 Gerrit-Owner: Yingchun Lai <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Comment-Date: Thu, 21 Mar 2019 05:32:05 +0000 Gerrit-HasComments: Yes
