amogh-jahagirdar commented on PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#issuecomment-3726953651

   Still cleaning some stuff up, so leaving in draft but feel free to comment. 
But basically there are some cases in Spark where a file can be split across 
multiple tasks,  and if deletes happen to touch every single part in the task 
we'd incorrectly produce multiple DVs for a given data file (discovered this 
recently with a user when they had Spark AQE enabled). 
   
    We currently throw on read in such cases, but ideally we can try and 
prevent this on write.
   
   
   The reason this is done behind the API is largely so that we are defensive 
from a library perspective that in case an engine/integration happens to 
produce multiple DVs, we can at least fix it up pre-commit.
   
   In the case there are too many to reasonably rewrite on a single node, then 
engines could do distributed writes to fix up before handing off the files to 
the API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to