RussellSpitzer commented on PR #15006:
URL: https://github.com/apache/iceberg/pull/15006#issuecomment-3745139819

   I'm mostly on board here but I have a few concerns,
   
   1. I Think the implementation here is trying to be efficient with memory 
usage by populating the map only if a duplicate is found, but I'm not sure it's 
really saving us all that much when every time we do hit a match we have to 
scan the entire list dv's again to find the dv which is the duplicate. I'd 
consider whether just keeping a mapping of reference data files to delete 
vectors is that much more expensive. It would make the lookup much simpler
   
   2. Intrinsically we are cleaning up after something which shouldn't be 
happening. Given that we know fixing this behavior in spark may take a while or 
... may just be impossible without further mods, I think having a fix here is 
better than throwing an exception (which could be another solution) but It 
feels weird that we are rewriting things from the driver here. I think this is 
now the first time that the Spark integration writes data/delete files from the 
driver and that feels a little weird to me.
   
   3. I agree with the requests for logging, this is probably really not good 
for performance and we should have some notes that it is happening for users 
although that's not really important now unless we have a way to avoid it from 
spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to