Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19850 )

Change subject: IMPALA-11619: Improve Iceberg V2 reads with a custom Iceberg 
Position Delete operator
......................................................................


Patch Set 20:

(31 comments)

http://gerrit.cloudera.org:8080/#/c/19850/20//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19850/20//COMMIT_MSG@9
PS20, Line 9: Implemented new exec node to optimize iceberg v2 read performance.
Please elaborate on how this new operator works.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h
File be/src/exec/iceberg-delete-builder.h:

http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@50
PS20, Line 50: Since it is expected to only be
             : /// created and used by IcebergDeletePlanNode only, the 
DataSinkConfig::Init()
             : /// and DataSinkConfig::CreateSink() are not implemented for it.
I see this comment is also present at PhjBuilderConfig, but I'm not sure if I 
understand it. Especially that CreateSink() and Init() are implemented.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@57
PS20, Line 57: a
nit: an


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@95
PS20, Line 95: ///
Please write a short summary about how the builder works.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@100
PS20, Line 100:
nit: unnecessary empty line.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@158
PS20, Line 158: spilling partitions
This builder doesn't partition its data and doesn't spill.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@180
PS20, Line 180: const
Could be 'static const', or 'static constexpr'


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.h@180
PS20, Line 180: _
If it will be a static member then we'll not need the last '_'.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc
File be/src/exec/iceberg-delete-builder.cc:

http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@21
PS20, Line 21: #include <iomanip>
Unused?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@156
PS20, Line 156:         // if(delete_scan_node == nullptr){}
Code line commented out


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@175
PS20, Line 175:       auto tuple_id_ = 
delete_scan_node->tnode_->hdfs_scan_node.tuple_id;
              :       auto tuple_desc_ = 
runtime_state_->desc_tbl().GetTupleDescriptor(tuple_id_);
              :       DCHECK(tuple_desc_->table_desc() != NULL);
              :       auto hdfs_table_ =
              :           static_cast<const 
HdfsTableDescriptor*>(tuple_desc_->table_desc());
              :       HdfsPartitionDescriptor* partition_desc =
              :           hdfs_table_->GetPartition(split.partition_id());
These should be the same for all the splits as there's a single HMS partition 
for Icebergtables.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@184
PS20, Line 184: hdfs_table_->IsIcebergTable()
This should be a DCHECK I think.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@200
PS20, Line 200:       DCHECK(retval.second == true);
We wn't hit this in RELEASE build, so the next statement would probably crash 
Impala. Probably it would be better to return a Status error.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@289
PS20, Line 289: inline bool IcebergDeleteBuilder::AppendRow(
              :     BufferedTupleStream* stream, TupleRow* row, Status* status) 
{
              :   if (LIKELY(stream->AddRow(row, status))) return true;
              :   return false;
              : }
Unused?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-builder.cc@331
PS20, Line 331: auto it = deleted_rows_.find(*file_path);
nit: for readability, can we move this out of the if stmt?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h
File be/src/exec/iceberg-delete-node.h:

http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@61
PS20, Line 61: s
Isn't it only one hash table?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@68
PS20, Line 68:
+ 'that'?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@81
PS20, Line 81:
Redundant empty line


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@88
PS20, Line 88: virtual
'virtual' keywords are redundant here


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@106
PS20, Line 106: ,
nit: .?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@119
PS20, Line 119:
Unnecessary empty line


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@141
PS20, Line 141: the
multiple 'the'


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.h@212
PS20, Line 212: const
Can be 'static const' or 'constexpr'. Also the last '_' is not needed in this 
case.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc
File be/src/exec/iceberg-delete-node.cc:

http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@25
PS20, Line 25: #include "codegen/llvm-codegen.h"
Unused


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@163
PS20, Line 163: }
              : v
missing empty line


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@166
PS20, Line 166: attached attached
multiple 'attached'


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@218
PS20, Line 218:   // Putting SCOPED_TIMER in the IR version of 
ProcessProbeBatch() causes weird exception
              :   // handling IR in the xcompiled function, so call it here 
instead.
This operator doesn't do codegen.


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@333
PS20, Line 333: namespace impala {
This line could be moved to L46 where we have 'using namespace impala'


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@347
PS20, Line 347:   current_probe_pos_ = *probe_pos;
There could be a fast path when file_path == previous_file_path_


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@354
PS20, Line 354: current_file_path_ != previous_file_path_
Why do we need both 'current_file_path_' when we also have the parameter 
'file_path'?


http://gerrit.cloudera.org:8080/#/c/19850/20/be/src/exec/iceberg-delete-node.cc@380
PS20, Line 380: current_deleted_pos_row_id_
To me it seems like that the code has the assumption that the file positions 
are coming in strictly ascending order, even across row batch boundaries, but 
this might not be true when a file have multiple splits, and either the 
followings are true:

 * MT_DOP = 0 # The multi-threaded SCAN operator might produces rows 
simultaneously from different splits of the same file
 * PARTITIONED mode # If multiple splits of the same file are assigned to 
different SCAN operators, then the EXCHANGE operators can also get the rows 
simultaneously from different splits

I think the first case can be resolved by resolving the state for each input 
row batch. The PARTITIONED case need to be double-checked, to make sure we 
don't coalesce small row batches in the EXCHANGE RECEIVERs.



--
To view, visit http://gerrit.cloudera.org:8080/19850
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I024a61573c83bda5584f243c879d9ff39dd2dcfa
Gerrit-Change-Number: 19850
Gerrit-PatchSet: 20
Gerrit-Owner: Gergely Fürnstáhl <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Gergely Fürnstáhl <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 22 Jun 2023 16:13:29 +0000
Gerrit-HasComments: Yes

Reply via email to