Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19850 )
Change subject: IMPALA-11619: Improve Iceberg V2 reads with a custom Iceberg Position Delete operator ...................................................................... Patch Set 20: (31 comments) http://gerrit.cloudera.org:8080/#/c/19850/20//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19850/20//COMMIT_MSG@9 PS20, Line 9: Implemented new exec node to optimize iceberg v2 read performance. Please elaborate on how this new operator works. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h File be/src/exec/iceberg-delete-builder.h: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@50 PS20, Line 50: Since it is expected to only be : /// created and used by IcebergDeletePlanNode only, the DataSinkConfig::Init() : /// and DataSinkConfig::CreateSink() are not implemented for it. I see this comment is also present at PhjBuilderConfig, but I'm not sure if I understand it. Especially that CreateSink() and Init() are implemented. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@57 PS20, Line 57: a nit: an http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@95 PS20, Line 95: /// Please write a short summary about how the builder works. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@100 PS20, Line 100: nit: unnecessary empty line. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@158 PS20, Line 158: spilling partitions This builder doesn't partition its data and doesn't spill. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@180 PS20, Line 180: const Could be 'static const', or 'static constexpr' http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@180 PS20, Line 180: _ If it will be a static member then we'll not need the last '_'. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc File be/src/exec/iceberg-delete-builder.cc: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@21 PS20, Line 21: #include <iomanip> Unused? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@156 PS20, Line 156: // if(delete_scan_node == nullptr){} Code line commented out http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@175 PS20, Line 175: auto tuple_id_ = delete_scan_node->tnode_->hdfs_scan_node.tuple_id; : auto tuple_desc_ = runtime_state_->desc_tbl().GetTupleDescriptor(tuple_id_); : DCHECK(tuple_desc_->table_desc() != NULL); : auto hdfs_table_ = : static_cast<const HdfsTableDescriptor*>(tuple_desc_->table_desc()); : HdfsPartitionDescriptor* partition_desc = : hdfs_table_->GetPartition(split.partition_id()); These should be the same for all the splits as there's a single HMS partition for Icebergtables. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@184 PS20, Line 184: hdfs_table_->IsIcebergTable() This should be a DCHECK I think. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@200 PS20, Line 200: DCHECK(retval.second == true); We wn't hit this in RELEASE build, so the next statement would probably crash Impala. Probably it would be better to return a Status error. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@289 PS20, Line 289: inline bool IcebergDeleteBuilder::AppendRow( : BufferedTupleStream* stream, TupleRow* row, Status* status) { : if (LIKELY(stream->AddRow(row, status))) return true; : return false; : } Unused? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@331 PS20, Line 331: auto it = deleted_rows_.find(*file_path); nit: for readability, can we move this out of the if stmt? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h File be/src/exec/iceberg-delete-node.h: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@61 PS20, Line 61: s Isn't it only one hash table? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@68 PS20, Line 68: + 'that'? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@81 PS20, Line 81: Redundant empty line http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@88 PS20, Line 88: virtual 'virtual' keywords are redundant here http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@106 PS20, Line 106: , nit: .? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@119 PS20, Line 119: Unnecessary empty line http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@141 PS20, Line 141: the multiple 'the' http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@212 PS20, Line 212: const Can be 'static const' or 'constexpr'. Also the last '_' is not needed in this case. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc File be/src/exec/iceberg-delete-node.cc: http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@25 PS20, Line 25: #include "codegen/llvm-codegen.h" Unused http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@163 PS20, Line 163: } : v missing empty line http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@166 PS20, Line 166: attached attached multiple 'attached' http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@218 PS20, Line 218: // Putting SCOPED_TIMER in the IR version of ProcessProbeBatch() causes weird exception : // handling IR in the xcompiled function, so call it here instead. This operator doesn't do codegen. http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@333 PS20, Line 333: namespace impala { This line could be moved to L46 where we have 'using namespace impala' http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@347 PS20, Line 347: current_probe_pos_ = *probe_pos; There could be a fast path when file_path == previous_file_path_ http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@354 PS20, Line 354: current_file_path_ != previous_file_path_ Why do we need both 'current_file_path_' when we also have the parameter 'file_path'? http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@380 PS20, Line 380: current_deleted_pos_row_id_ To me it seems like that the code has the assumption that the file positions are coming in strictly ascending order, even across row batch boundaries, but this might not be true when a file have multiple splits, and either the followings are true: * MT_DOP = 0 # The multi-threaded SCAN operator might produces rows simultaneously from different splits of the same file * PARTITIONED mode # If multiple splits of the same file are assigned to different SCAN operators, then the EXCHANGE operators can also get the rows simultaneously from different splits I think the first case can be resolved by resolving the state for each input row batch. The PARTITIONED case need to be double-checked, to make sure we don't coalesce small row batches in the EXCHANGE RECEIVERs. -- To view, visit http://gerrit.cloudera.org:8080/19850 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I024a61573c83bda5584f243c879d9ff39dd2dcfa Gerrit-Change-Number: 19850 Gerrit-PatchSet: 20 Gerrit-Owner: Gergely Fürnstáhl <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Gergely Fürnstáhl <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 22 Jun 2023 16:13:29 +0000 Gerrit-HasComments: Yes
