Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20760 )

Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables.
......................................................................


Patch Set 6:

(21 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/20760/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20760/5//COMMIT_MSG@26
PS5, Line 26: FlushFinal
> Can it spill to disk if the data it has received can't be stored in memory?
It cannot spill, though it's not realistic to have that many delete records 
buffered. I've added a paragraph about it.


http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-buffered-delete-sink.h
File be/src/exec/iceberg-buffered-delete-sink.h:

http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-buffered-delete-sink.h@72
PS5, Line 72:   /// Nested iterator class to conveniently iterate over a 
FilePositions object.
> We could also DCHECK that none of the vectors in the map are empty either.
Here this would need another iteration over the map (though only in debug 
mode). Instead of this I added a DCHECK to VerifyBufferedRecords().


http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-buffered-delete-sink.h@91
PS5, Line 91:   /// that SortBufferedRecords() has been called already.
> This condition should always be true, we've just dereferenced 'pos_level_it
Done


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.h
File be/src/exec/iceberg-buffered-delete-sink.h:

http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.h@35
PS4, Line 35: /// IcebergBufferedDeleteSink buffers the Iceberg position delete 
records in its
> There could be a comment describing when this class is/should be used, e.g.
Added comment.


http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h
File be/src/exec/iceberg-buffered-delete-sink.h:

http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@45
PS3, Line 45:
> Does 'partition_descriptor_map_' exist?
Updated the comments.


http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@48
PS3, Line 48: state, MemTracker* parent_m
> I can't see that it has changed in this file, though it did change in icebe
Yeah, sorry, updated the wrong comments.
Apart from Prepare/Open/Send/FlushFinal/Close I didn't copy legacy comments.


http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@69
PS3, Line 69:   /// It's necessary to use a std::map so we can get back the 
file paths in order.
> The typedefs could stay here, the class could be forward-declared here and
You're right, I didn't think of that.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.cc
File be/src/exec/iceberg-buffered-delete-sink.cc:

http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.cc@345
PS4, Line 345:   int capacity = batch->capacity();
> See comment at iceberg-delete-sink-base.cc:66.
Done


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h
File be/src/exec/iceberg-delete-sink-base.h:

http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h@30
PS4, Line 30: IcebergDeleteSinkBase
> Could we make this class pure virtual?
I'm not sure what do you mean by that. It is currently an abstract class as it 
doesn't override everything from DataSink(), e.g. Send().


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h@63
PS4, Line 63:       TIcebergPartitionTransformType::type transform_type, const 
std::string& value,
> An example would be useful.
Added comment.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h@70
PS4, Line 70:
> Does the unbuffered IcebergDeleteSink also need a separate DmlExecState? Is
Moved it to IcebergBufferedDeleteSink.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc
File be/src/exec/iceberg-delete-sink-base.cc:

http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@66
PS4, Line 66:     TIcebergPartitionTransformType::type transform_type, const 
std::string& value,
> The 'closed_' field is a bit strange, AFAICS it comes from DataSink and the
I don't really find the comment misleading. Subclasses need to check it and set 
it to false either directly or indirectly via SuperClass::Close(state);
I rather not modify that comment in this CR.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@88
PS4, Line 88:   ScalarExprEvaluator* spec_id_eval = 
partition_key_expr_evals_[0];
> Nit: doesn't it fit on the previous line?
Done


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@100
PS4, Line 100: _
> Why is it plural? Maybe it could be 'partition_values_eval', like the argum
Done


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@104
PS4, Line 104:
> See L100.
Done


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@112
PS4, Line 112:     // non void partition names and transforms.
> Optional: if this function is called from the other ConstructPartitionInfo(
Yeah, currently it is checked twice when the first overload is being used, but 
it is a very cheap check on a non-hot code path.

Regarding to code, with that refactoring we would have one more method, and we 
would still need to do the checks in the top level methods. So I'd rather keep 
it as is.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@113
PS4, Line 113:     non_void_partition_names = 
table_desc_->IcebergNonVoidPartitionNames();
> Is there a valid use case for calling this function (and providing a 'spec_
Added DCHECK. We should only fall to this code path if the table is 
unpartitioned, and there was no partition evolution.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@159
PS4, Line 159:     Status transform_status;
> Is this ever written?
Removed.


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@164
PS4, Line 164:
> It could be 'url_encoded_...' - "encoded" is ambiguous, it could also refer
Done


http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@169
PS4, Line 169: url_enc
> Is it intentional that we overwrite the elements of 'partition_values_decod
We save a string copy here, but to be honest, it shouldn't really matter as 
this is not a hot code path. OTOH, partition_values_decoded is a local 
variable, so I think it also doesn't matter that we modify its values.


http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.h
File be/src/exec/iceberg-delete-sink.h:

http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.h@66
PS5, Line 66:   /// having the same filepath + position), because that would 
corrupt the table data
> We could mention the case when the different deletes go to different sinks
That case is handled in IcebergUpdateImpl.java, so I added the comment there.
At the sink's level we just need to now that we must raise errors for 
duplicates. I don't really want to copy very long explanations to multiple 
places.



--
To view, visit http://gerrit.cloudera.org:8080/20760
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315
Gerrit-Change-Number: 20760
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 12 Dec 2023 11:28:28 +0000
Gerrit-HasComments: Yes

Reply via email to