[GitHub] [spark] steveloughran commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed

GitBox Thu, 19 Sep 2019 11:26:07 -0700

steveloughran commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives 
duplicate result when an application was killed
URL: https://github.com/apache/spark/pull/25795#issuecomment-533252462
 
 
   @cloud-fan wrote  
   
   > Seems it's very complicated to make Spark support concurrent and 
object-store friendly files writing.
   
   Mostly it's about S3, that being the hard one to work with. GCS is 
consistent; file rename O(1); dir rename atomic O(files). Azure Datalake Gen2 
(ABFS connector) works exactly like a real filesystem. But they all have 
latency issues, directory listing overheads, and the other little costs that 
make switching to them tricky.
   
   S3 is really a fault injection framework for all your assumptions about what 
a store is.
   
   >  We should ask users to try object-store-first Data structures such as 
Delta and Iceberg.
   
   +1
   
   What those object-store-first Data structures do promise is not just 
correctness in the presence of that fault injection layer, they scale well in 
world where enumerating files under a directory tree is no longer a 
low-to-medium cost operation. 
   
   w.r.t dynamic partition updates, for AWS S3 the best thing to do may be to 
say "don't"; at least not this way. But you still have to get it right 
everywhere else.
   
   > Spark should at least guarantee data correctness.
   
   I concur.   
   
   Now, one thing stepping through the FileOutputFormat code in a debugger 
while taking notes showed me is that the critical commit algorithms everyone 
relies on are barely documented and under-tested when it comes to all their 
failure paths. And once you realise that the v2 commit algorithm doesn't meet 
the expectations of MR or Spark, you worry about the correctness of any 
application which has used it.
   
   
   Now seems a good time, to not just get the dynamic partition overwrite code 
to work -but to specify that algorithm enough to be able to convince others of 
its correctness on a consistent file system. I did try to start with a TLA+ 
spec of what S3 did, but after Lamport told me I misunderstood the nature of 
"eventually" I gave up. I hear nice things about Isabelle/HOL these days -maybe 
someone sufficiently motivated could try to use it.
   
   BTW, if you haven't read it, look at the [Stocator: A High Performance 
Object Store Connector for Spark](https://arxiv.org/pdf/1709.01812) by @gilv et 
al, who use task ID information in the file names to know what to roll back in 
the presence of failure. That is: rather than trying to get commit right, they 
get abort right.
   
   
   (returns to debugging a console log where, somehow, cached 404s in the S3 
load balancer are still causing problems despite HADOOP-16490 fixing it).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed

Reply via email to