kevinjqliu commented on issue #1092:
URL: 
https://github.com/apache/iceberg-python/issues/1092#issuecomment-4001768913

   Thanks for taking the time to look into this @qzyu999! I think this is on 
the right track. 
   
   Looking at the `rewrite_data_files` implementation in spark, theres a lot of 
bells and whistles (probably added over time). For the pyiceberg 
implementation, It might be useful to scope the feature down as much as 
possible; just to create a harness and we can improve it over time. 
   
   What do you think about first handling the case for compaction of a whole 
table? That way to don't have to deal with `filter` and matching data files. 
   
   Im thinking something like `table.maintenance.compact()`, which will rewrite 
the table using the `REPLACE` operation. 
   For the actual data files, we can take a shortcut and just binpack by 
reading the table and writing it out again. This should produce the desired 
file size specified by `write.target-file-size-bytes` (which the write path 
already uses) 
   
   WDYT? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to