szlta commented on PR #4868:
URL: https://github.com/apache/iceberg/pull/4868#issuecomment-1138615924

   Hi @kbendick , thanks a lot for taking a look.
   
   Yes you're right, this issue does need a discussion. Originally I observed 
this in https://github.com/apache/iceberg/pull/4662#issuecomment-1125098416 
where one unit test kept on failing and shed light on this discrepancy within 
Iceberg code in how PartitionFIeld names are generated. 
   
   
   > My immediate concern is that if an engine is currently generating columns 
in one style, and begins to generate them in another, this could be a very 
breaking change for people's tables.
   
   I think in practice we were already using the version where transform 
arguments were appended to the partition field names. This is done by 
`BaseUpdatePartitionSpec.java`. The other method of generating the default 
names in `PartitionSpec$Builder` is mostly used by test code only with the one 
exception of Spark3Util. Well in Iceberg code right.. .. this is a public class 
with public methods in iceberg-api, so we will want to be careful with it, e.g. 
Hive calls this too.
   
   > Additionally, if we append the number of buckets / truncation width at the 
end of the generated column name, are we opening ourselves up to the 
possibility that users will be able to partition twice on the same column using 
the same transform (e.g. bucket_data_6 and bucket_data_8 within the same 
partition spec)?
   
   Is there a limitation to do this? I think the API already allows doing this 
by specifying the target name. My change here alters the default naming 
convention. I think it is a possible scenario that data_bucket_2 and 
data_bucket_4 are both part of the spec as the structure will be a nested one. 
It's probably not useful for the majority of scenarios but nevertheless the 
option is there.
   
   About your question on how the integrating engines would be affected - I'd 
like contributors for each engine to chime in. I can only speak for Hive. In 
Hive we currently generate the "arg-less" version of the partition name during 
table creation. On the other hand if we alter the spec later then the "arg-ful" 
version will kick in.
   E.g.:
   ```
   create external table ptest (a string, b int) partitioned by spec 
(bucket(16, a)) stored by iceberg;
   alter table ptest set partition spec (bucket(16, a), bucket(2, b));
   describe default.ptest.partitions;
   +---------------+--------------------------------------+----------+
   |   col_name    |              data_type               | comment  |
   +---------------+--------------------------------------+----------+
   | partition     | struct<a_bucket:int,b_bucket_2:int>  |          |
   | record_count  | bigint                               |          |
   | file_count    | int                                  |          |
   | spec_id       | int                                  |          |
   +---------------+--------------------------------------+----------+
   ```
   doesn't look too nice, does it? But I think we're still early enough to 
unify these names in Hive.
   
   Perhaps we could also annotate the current PartitionSpec$Builder as 
deprecated and create a new version of it, that will already be the unified 
implementation, and hook unit tests, and everything else within Iceberg 
codebase to that. Then we can give 1-2 release worth of time before we remove 
the original implementation, so at least we won't cause an immediate backward 
incompatibility problem.
   
   Do let me know your thoughts.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to