[GitHub] [iceberg] szlta commented on a change in pull request #3411: Fixes read metadata failed after dropped partition

GitBox Thu, 10 Mar 2022 05:00:48 -0800


szlta commented on a change in pull request #3411:
URL: https://github.com/apache/iceberg/pull/3411#discussion_r823692312




##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -210,7 +211,8 @@ public static StructType partitionType(Table table) {
     }
 
     Map<Integer, PartitionField> fieldMap = Maps.newHashMap();
-    List<NestedField> structFields = Lists.newArrayList();
+    Map<Integer, Type> typeMap = Maps.newHashMap();
+    Map<Integer, String> nameMap = Maps.newHashMap();

Review comment:
       What about partition transforms where the transform itself could change, 
but the field name remains the same?
   E.g.: `bucket(data, 10)` would be `data_bucket`, and after dropping this 
from spec and re-adding with a new spec `bucket(data, 8), id`, then we would 
have `1000: data_bucket, 1001: data_bucket, 1002: id `where the two 
`data_bucket` fields describe different things. I guess this is similar for 
truncate.
   
   A partition in the old spec where `data_bucket = 0` would be different from 
a partition in the new spec where `data_bucket = 0`
   
   I agree with @rdblue that it's weird to have `category_r1003` in queries, 
but if we want to avoid it, I see two ways of proceeding from metadata query 
perspective:
   A metadata table should..
   1. ..either give back partition information as per the latest spec only..
   2. ..or we could combine the data returned for similarly named partition 
fields cascaded into just one field, but extend such tables with `spec_id` 
information in these cases.
   
   So as per above example use case the `partitions` table would look like this 
for the two cases (with 1 partition in each spec):
   
   ```
   1:
   +---------------------------------+--------------------+------------------+
   |         test.partition          | test.record_count  | test.file_count  |
   +---------------------------------+--------------------+------------------+
   | {data_bucket : 0, id : 1}       |          1         |        1         |
   +---------------------------------+--------------------+------------------+
   
   2:
   
+---------------------------------+--------------------+------------------+--------------+
   |         test.partition          | test.record_count  | test.file_count  | 
test.spec_id |
   
+---------------------------------+--------------------+------------------+--------------+
   | {data_bucket : 0, id : null}    |          1         |        1         |  
     0      |
   | {data_bucket : 0, id : 1}       |          1         |        1         |  
     1      |
   
+---------------------------------+--------------------+------------------+--------------+
   ```
   Note this is how it would be for V2 tables. For V1, due to the renaming 
we're already doing, the renamed fields would be present too (unless we aim at 
changing that too):
   ```
   1:
   +---------------------------------+--------------------+------------------+
   |         test.partition          | test.record_count  | test.file_count  |
   +---------------------------------+--------------------+------------------+
   | {data_bucket : 0, id : 1}       |          1         |        1         |
   +---------------------------------+--------------------+------------------+
   
   2:
   
+----------------------------------------------------------+--------------------+------------------+--------------+
   |         test.partition                                   | 
test.record_count  | test.file_count  | test.spec_id |
   
+----------------------------------------------------------+--------------------+------------------+--------------+
   | {data_bucket_1000 : 0,    data_bucket : null, id : null} |          1      
   |        1         |       0      |
   | {data_bucket_1000 : null, data_bucket: 0,     id : 1}    |          1      
   |        1         |       1      |
   
+----------------------------------------------------------+--------------------+------------------+--------------+
   ```
   
   I'm personally more for solution 2, where we don't have to omit the old 
partitions but at the same time we get a nice and coherent partition info 
result. It's kind of in league what I'm proposing in issue #4292 (we could also 
continue the discussion of this problem there, I didn't mean to hijack 
@ConeyLiu 's PR like this)
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] szlta commented on a change in pull request #3411: Fixes read metadata failed after dropped partition

Reply via email to