szlta commented on a change in pull request #3411:
URL: https://github.com/apache/iceberg/pull/3411#discussion_r823692312
##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -210,7 +211,8 @@ public static StructType partitionType(Table table) {
}
Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
- List<NestedField> structFields = Lists.newArrayList();
+ Map<Integer, Type> typeMap = Maps.newHashMap();
+ Map<Integer, String> nameMap = Maps.newHashMap();
Review comment:
What about partition transforms where the transform itself could change,
but the field name remains the same?
E.g.: `bucket(data, 10)` would be `data_bucket`, and after dropping this
from spec and re-adding with a new spec `bucket(data, 8), id`, then we would
have `1000: data_bucket, 1001: data_bucket, 1002: id `where the two
`data_bucket` fields describe different things. I guess this is similar for
truncate.
A partition in the old spec where `data_bucket = 0` would be different from
a partition in the new spec where `data_bucket = 0`
I agree with @rdblue that it's weird to have `category_r1003` in queries,
but if we want to avoid it, I see two ways of proceeding from metadata query
perspective:
A metadata table should..
1. ..either give back partition information as per the latest spec only..
2. ..or we could combine the data returned for similarly named partition
fields cascaded into just one field, but extend such tables with `spec_id`
information in these cases.
So as per above example use case the `partitions` table would look like this
for the two cases (with 1 partition in each spec):
```
1:
+---------------------------------+--------------------+------------------+
| test.partition | test.record_count | test.file_count |
+---------------------------------+--------------------+------------------+
| {data_bucket : 0, id : 1} | 1 | 1 |
+---------------------------------+--------------------+------------------+
2:
+---------------------------------+--------------------+------------------+--------------+
| test.partition | test.record_count | test.file_count |
test.spec_id |
+---------------------------------+--------------------+------------------+--------------+
| {data_bucket : 0, id : null} | 1 | 1 |
0 |
| {data_bucket : 0, id : 1} | 1 | 1 |
1 |
+---------------------------------+--------------------+------------------+--------------+
```
Note this is how it would be for V2 tables. For V1, due to the renaming
we're already doing, the renamed fields would be present too (unless we aim at
changing that too):
```
1:
+---------------------------------+--------------------+------------------+
| test.partition | test.record_count | test.file_count |
+---------------------------------+--------------------+------------------+
| {data_bucket : 0, id : 1} | 1 | 1 |
+---------------------------------+--------------------+------------------+
2:
+----------------------------------------------------------+--------------------+------------------+--------------+
| test.partition |
test.record_count | test.file_count | test.spec_id |
+----------------------------------------------------------+--------------------+------------------+--------------+
| {data_bucket_1000 : 0, data_bucket : null, id : null} | 1
| 1 | 0 |
| {data_bucket_1000 : null, data_bucket: 0, id : 1} | 1
| 1 | 1 |
+----------------------------------------------------------+--------------------+------------------+--------------+
```
I'm personally more for solution 2, where we don't have to omit the old
partitions but at the same time we get a nice and coherent partition info
result. It's kind of in league what I'm proposing in issue #4292 (we could also
continue the discussion of this problem there, I didn't mean to hijack
@ConeyLiu 's PR like this)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]