Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/20760 )
Change subject: IMPALA-12313: (part 3) Add UPDATE support for Iceberg tables. ...................................................................... Patch Set 6: (21 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/20760/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20760/5//COMMIT_MSG@26 PS5, Line 26: FlushFinal > Can it spill to disk if the data it has received can't be stored in memory? It cannot spill, though it's not realistic to have that many delete records buffered. I've added a paragraph about it. http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-buffered-delete-sink.h File be/src/exec/iceberg-buffered-delete-sink.h: http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-buffered-delete-sink.h@72 PS5, Line 72: /// Nested iterator class to conveniently iterate over a FilePositions object. > We could also DCHECK that none of the vectors in the map are empty either. Here this would need another iteration over the map (though only in debug mode). Instead of this I added a DCHECK to VerifyBufferedRecords(). http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-buffered-delete-sink.h@91 PS5, Line 91: /// that SortBufferedRecords() has been called already. > This condition should always be true, we've just dereferenced 'pos_level_it Done http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.h File be/src/exec/iceberg-buffered-delete-sink.h: http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.h@35 PS4, Line 35: /// IcebergBufferedDeleteSink buffers the Iceberg position delete records in its > There could be a comment describing when this class is/should be used, e.g. Added comment. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h File be/src/exec/iceberg-buffered-delete-sink.h: http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@45 PS3, Line 45: > Does 'partition_descriptor_map_' exist? Updated the comments. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@48 PS3, Line 48: state, MemTracker* parent_m > I can't see that it has changed in this file, though it did change in icebe Yeah, sorry, updated the wrong comments. Apart from Prepare/Open/Send/FlushFinal/Close I didn't copy legacy comments. http://gerrit.cloudera.org:8080/#/c/20760/3/be/src/exec/iceberg-buffered-delete-sink.h@69 PS3, Line 69: /// It's necessary to use a std::map so we can get back the file paths in order. > The typedefs could stay here, the class could be forward-declared here and You're right, I didn't think of that. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.cc File be/src/exec/iceberg-buffered-delete-sink.cc: http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-buffered-delete-sink.cc@345 PS4, Line 345: int capacity = batch->capacity(); > See comment at iceberg-delete-sink-base.cc:66. Done http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h File be/src/exec/iceberg-delete-sink-base.h: http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h@30 PS4, Line 30: IcebergDeleteSinkBase > Could we make this class pure virtual? I'm not sure what do you mean by that. It is currently an abstract class as it doesn't override everything from DataSink(), e.g. Send(). http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h@63 PS4, Line 63: TIcebergPartitionTransformType::type transform_type, const std::string& value, > An example would be useful. Added comment. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.h@70 PS4, Line 70: > Does the unbuffered IcebergDeleteSink also need a separate DmlExecState? Is Moved it to IcebergBufferedDeleteSink. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc File be/src/exec/iceberg-delete-sink-base.cc: http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@66 PS4, Line 66: TIcebergPartitionTransformType::type transform_type, const std::string& value, > The 'closed_' field is a bit strange, AFAICS it comes from DataSink and the I don't really find the comment misleading. Subclasses need to check it and set it to false either directly or indirectly via SuperClass::Close(state); I rather not modify that comment in this CR. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@88 PS4, Line 88: ScalarExprEvaluator* spec_id_eval = partition_key_expr_evals_[0]; > Nit: doesn't it fit on the previous line? Done http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@100 PS4, Line 100: _ > Why is it plural? Maybe it could be 'partition_values_eval', like the argum Done http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@104 PS4, Line 104: > See L100. Done http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@112 PS4, Line 112: // non void partition names and transforms. > Optional: if this function is called from the other ConstructPartitionInfo( Yeah, currently it is checked twice when the first overload is being used, but it is a very cheap check on a non-hot code path. Regarding to code, with that refactoring we would have one more method, and we would still need to do the checks in the top level methods. So I'd rather keep it as is. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@113 PS4, Line 113: non_void_partition_names = table_desc_->IcebergNonVoidPartitionNames(); > Is there a valid use case for calling this function (and providing a 'spec_ Added DCHECK. We should only fall to this code path if the table is unpartitioned, and there was no partition evolution. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@159 PS4, Line 159: Status transform_status; > Is this ever written? Removed. http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@164 PS4, Line 164: > It could be 'url_encoded_...' - "encoded" is ambiguous, it could also refer Done http://gerrit.cloudera.org:8080/#/c/20760/4/be/src/exec/iceberg-delete-sink-base.cc@169 PS4, Line 169: url_enc > Is it intentional that we overwrite the elements of 'partition_values_decod We save a string copy here, but to be honest, it shouldn't really matter as this is not a hot code path. OTOH, partition_values_decoded is a local variable, so I think it also doesn't matter that we modify its values. http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.h File be/src/exec/iceberg-delete-sink.h: http://gerrit.cloudera.org:8080/#/c/20760/5/be/src/exec/iceberg-delete-sink.h@66 PS5, Line 66: /// having the same filepath + position), because that would corrupt the table data > We could mention the case when the different deletes go to different sinks That case is handled in IcebergUpdateImpl.java, so I added the comment there. At the sink's level we just need to now that we must raise errors for duplicates. I don't really want to copy very long explanations to multiple places. -- To view, visit http://gerrit.cloudera.org:8080/20760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2bb97b4454165a292975d88dc9c23adb22ff7315 Gerrit-Change-Number: 20760 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tamas Mate <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 12 Dec 2023 11:28:28 +0000 Gerrit-HasComments: Yes
