Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/21761 )
Change subject: IMPALA-13364: Schema resolution doesn't work for migrated partitioned Iceberg tables that have complex types ...................................................................... Patch Set 4: (8 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc File be/src/exec/file-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@95 PS4, Line 95: file_metadata > should we DCHECK if 'file_metadata' is not null Done http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@239 PS4, Line 239: int fileFieldID = *fieldID; > 'fieldId' could be null. Add a DCHECK? Done http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@241 PS4, Line 241: using namespace org::apache::impala::fb; > I think that the code between L241-L269 is not related to the actual fieldI I can refactor a bit if you feel strong about this, but currently it is only used by this function. Also, having complex return values then using only the first element and size wouldn't be much more readable than sourceIDs.front() and sourceIDs.back() IMO. http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@248 PS4, Line 248: for (int i = 0; i < transforms->size(); ++i) { > We don't need 'i' within the loop. Can this be a foreach? Done http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@251 PS4, Line 251: FbIcebergTransformType::FbIcebergTransformType_IDENTITY) { > nit: too much indentation Done http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@280 PS4, Line 280: return Status(errorMsg); > I think we can be more specific here with the error msg. We could let the u Based on the field IDs only we cannot tell what's wrong exactly, it's possible that * there are less data columns than expected * there are more data columns than expected * some partitions are stored, some aren't What we know for sure is that the schema or partitioning is wrong but we cannot tell what is the exact issue, and it's also cumbersome to test all different cases. It's easier to handle them together and tell the user that something is fishy with their files. I think it should be also fairly rare to have such data files. http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/parquet/parquet-metadata-utils.cc@1022 PS4, Line 1022: // Partition columns are not stored in file metadata, but they get field IDs > I'm just thinking about the feasibility of the comment I left in FileMetada I'm not sure I if get this, but collecting the children's field IDs, then doing the validation check, then adjusting field IDs seem to be more complicated to me than doing this check here. It is also checked only once, when we iterated over the top-level schema elements, and we just adjust a single integer 'fieldID'. Please note that we also fill out 'schema_node_to_field_id_' during the process. "So we come to this part of the code for the children of complex types, and only if this is a migrated, partitioned Iceberg table, right?" We enter the if statement if * table is migrated (otherwise GenerateFieldIDs() is not called) * we just iterated over the top-level columns (current == &schema_) * there are complex types to process (!nodes.empty()) http://gerrit.cloudera.org:8080/#/c/21761/4/tests/query_test/test_iceberg.py File tests/query_test/test_iceberg.py: http://gerrit.cloudera.org:8080/#/c/21761/4/tests/query_test/test_iceberg.py@281 PS4, Line 281: id INT, > line has trailing whitespace Done -- To view, visit http://gerrit.cloudera.org:8080/21761 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie32952021b63d6b55b8820489e434bfc2a91580b Gerrit-Change-Number: 21761 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 13 Sep 2024 20:20:24 +0000 Gerrit-HasComments: Yes
