RussellSpitzer commented on PR #15006: URL: https://github.com/apache/iceberg/pull/15006#issuecomment-3745139819
I'm mostly on board here but I have a few concerns, 1. I Think the implementation here is trying to be efficient with memory usage by populating the map only if a duplicate is found, but I'm not sure it's really saving us all that much when every time we do hit a match we have to scan the entire list dv's again to find the dv which is the duplicate. I'd consider whether just keeping a mapping of reference data files to delete vectors is that much more expensive. It would make the lookup much simpler 2. Intrinsically we are cleaning up after something which shouldn't be happening. Given that we know fixing this behavior in spark may take a while or ... may just be impossible without further mods, I think having a fix here is better than throwing an exception (which could be another solution) but It feels weird that we are rewriting things from the driver here. I think this is now the first time that the Spark integration writes data/delete files from the driver and that feels a little weird to me. 3. I agree with the requests for logging, this is probably really not good for performance and we should have some notes that it is happening for users although that's not really important now unless we have a way to avoid it from spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
