szlta commented on PR #4868: URL: https://github.com/apache/iceberg/pull/4868#issuecomment-1138615924
Hi @kbendick , thanks a lot for taking a look. Yes you're right, this issue does need a discussion. Originally I observed this in https://github.com/apache/iceberg/pull/4662#issuecomment-1125098416 where one unit test kept on failing and shed light on this discrepancy within Iceberg code in how PartitionFIeld names are generated. > My immediate concern is that if an engine is currently generating columns in one style, and begins to generate them in another, this could be a very breaking change for people's tables. I think in practice we were already using the version where transform arguments were appended to the partition field names. This is done by `BaseUpdatePartitionSpec.java`. The other method of generating the default names in `PartitionSpec$Builder` is mostly used by test code only with the one exception of Spark3Util. Well in Iceberg code right.. .. this is a public class with public methods in iceberg-api, so we will want to be careful with it, e.g. Hive calls this too. > Additionally, if we append the number of buckets / truncation width at the end of the generated column name, are we opening ourselves up to the possibility that users will be able to partition twice on the same column using the same transform (e.g. bucket_data_6 and bucket_data_8 within the same partition spec)? Is there a limitation to do this? I think the API already allows doing this by specifying the target name. My change here alters the default naming convention. I think it is a possible scenario that data_bucket_2 and data_bucket_4 are both part of the spec as the structure will be a nested one. It's probably not useful for the majority of scenarios but nevertheless the option is there. About your question on how the integrating engines would be affected - I'd like contributors for each engine to chime in. I can only speak for Hive. In Hive we currently generate the "arg-less" version of the partition name during table creation. On the other hand if we alter the spec later then the "arg-ful" version will kick in. E.g.: ``` create external table ptest (a string, b int) partitioned by spec (bucket(16, a)) stored by iceberg; alter table ptest set partition spec (bucket(16, a), bucket(2, b)); describe default.ptest.partitions; +---------------+--------------------------------------+----------+ | col_name | data_type | comment | +---------------+--------------------------------------+----------+ | partition | struct<a_bucket:int,b_bucket_2:int> | | | record_count | bigint | | | file_count | int | | | spec_id | int | | +---------------+--------------------------------------+----------+ ``` doesn't look too nice, does it? But I think we're still early enough to unify these names in Hive. Perhaps we could also annotate the current PartitionSpec$Builder as deprecated and create a new version of it, that will already be the unified implementation, and hook unit tests, and everything else within Iceberg codebase to that. Then we can give 1-2 release worth of time before we remove the original implementation, so at least we won't cause an immediate backward incompatibility problem. Do let me know your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
