steveloughran commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed URL: https://github.com/apache/spark/pull/25795#issuecomment-533252462 @cloud-fan wrote > Seems it's very complicated to make Spark support concurrent and object-store friendly files writing. Mostly it's about S3, that being the hard one to work with. GCS is consistent; file rename O(1); dir rename atomic O(files). Azure Datalake Gen2 (ABFS connector) works exactly like a real filesystem. But they all have latency issues, directory listing overheads, and the other little costs that make switching to them tricky. S3 is really a fault injection framework for all your assumptions about what a store is. > We should ask users to try object-store-first Data structures such as Delta and Iceberg. +1 What those object-store-first Data structures do promise is not just correctness in the presence of that fault injection layer, they scale well in world where enumerating files under a directory tree is no longer a low-to-medium cost operation. w.r.t dynamic partition updates, for AWS S3 the best thing to do may be to say "don't"; at least not this way. But you still have to get it right everywhere else. > Spark should at least guarantee data correctness. I concur. Now, one thing stepping through the FileOutputFormat code in a debugger while taking notes showed me is that the critical commit algorithms everyone relies on are barely documented and under-tested when it comes to all their failure paths. And once you realise that the v2 commit algorithm doesn't meet the expectations of MR or Spark, you worry about the correctness of any application which has used it. Now seems a good time, to not just get the dynamic partition overwrite code to work -but to specify that algorithm enough to be able to convince others of its correctness on a consistent file system. I did try to start with a TLA+ spec of what S3 did, but after Lamport told me I misunderstood the nature of "eventually" I gave up. I hear nice things about Isabelle/HOL these days -maybe someone sufficiently motivated could try to use it. BTW, if you haven't read it, look at the [Stocator: A High Performance Object Store Connector for Spark](https://arxiv.org/pdf/1709.01812) by @gilv et al, who use task ID information in the file names to know what to roll back in the presence of failure. That is: rather than trying to get commit right, they get abort right. (returns to debugging a console log where, somehow, cached 404s in the S3 load balancer are still causing problems despite HADOOP-16490 fixing it).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
