Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21761 )

Change subject: IMPALA-13364: Schema resolution doesn't work for migrated 
partitioned Iceberg tables that have complex types
......................................................................


Patch Set 4:

(8 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc
File be/src/exec/file-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@95
PS4, Line 95: file_metadata
> should we DCHECK if 'file_metadata' is not null
Done


http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@239
PS4, Line 239:   int fileFieldID = *fieldID;
> 'fieldId' could be null. Add a DCHECK?
Done


http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@241
PS4, Line 241:   using namespace org::apache::impala::fb;
> I think that the code between L241-L269 is not related to the actual fieldI
I can refactor a bit if you feel strong about this, but currently it is only 
used by this function. Also, having complex return values then using only the 
first element and size wouldn't be much more readable than sourceIDs.front() 
and sourceIDs.back() IMO.


http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@248
PS4, Line 248:   for (int i = 0; i < transforms->size(); ++i) {
> We don't need 'i' within the loop. Can this be a foreach?
Done


http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@251
PS4, Line 251:             
FbIcebergTransformType::FbIcebergTransformType_IDENTITY) {
> nit: too much indentation
Done


http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/file-metadata-utils.cc@280
PS4, Line 280:     return Status(errorMsg);
> I think we can be more specific here with the error msg. We could let the u
Based on the field IDs only we cannot tell what's wrong exactly, it's possible 
that
* there are less data columns than expected
* there are more data columns than expected
* some partitions are stored, some aren't

What we know for sure is that the schema or partitioning is wrong but we cannot 
tell what is the exact issue, and it's also cumbersome to test all different 
cases. It's easier to handle them together and tell the user that something is 
fishy with their files. I think it should be also fairly rare to have such data 
files.


http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/21761/4/be/src/exec/parquet/parquet-metadata-utils.cc@1022
PS4, Line 1022:       // Partition columns are not stored in file metadata, but 
they get field IDs
> I'm just thinking about the feasibility of the comment I left in FileMetada
I'm not sure I if get this, but collecting the children's field IDs, then doing 
the validation check, then adjusting field IDs seem to be more complicated to 
me than doing this check here. It is also checked only once, when we iterated 
over the top-level schema elements, and we just adjust a single integer 
'fieldID'.

Please note that we also fill out 'schema_node_to_field_id_' during the process.

"So we come to this part of the code for the children of complex types, and 
only if this is a migrated, partitioned Iceberg table, right?"
We enter the if statement if

* table is migrated (otherwise GenerateFieldIDs() is not called)
* we just iterated over the top-level columns (current == &schema_)
* there are complex types to process (!nodes.empty())


http://gerrit.cloudera.org:8080/#/c/21761/4/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/21761/4/tests/query_test/test_iceberg.py@281
PS4, Line 281:             id INT,
> line has trailing whitespace
Done



--
To view, visit http://gerrit.cloudera.org:8080/21761
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie32952021b63d6b55b8820489e434bfc2a91580b
Gerrit-Change-Number: 21761
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 13 Sep 2024 20:20:24 +0000
Gerrit-HasComments: Yes

Reply via email to