[GitHub] spark pull request: [SPARK-8029][core][wip] first successful shuff...

mateiz Fri, 23 Oct 2015 15:37:33 -0700

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/9214#issuecomment-150708764
  
    Hey so I'm curious about two things here:
    1) If we just always replaced the output with a new one using a file 
rename, would we actually have a problem? I think that any thread that has a 
file open will still be reading from the old version of the file if you do a 
rename. You should double-check this, but I don't think it will switch 
mid-file. That might mean the "last task wins" strategy works.
    2) Otherwise, what I would do is store the status in a separate file, 
similar to the .index file we have for sort-based shuffle. There's no memory 
overhead and it's easy to read it back again when we're given a map task and we 
see that an output block for it already exists.
    
    Regarding shuffle files getting corrupted somehow, I think this is super 
unlikely and I haven't seen many systems try to defend against this. If this 
were an issue, we'd also have to worry about data cached with DISK_ONLY being 
corrupted, etc. I think this is considered in systems like HDFS because they 
store a huge amount of data for a very long time, but I don't think it's a 
major problem in Spark, and we can always add checksums later if we see it 
happen.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8029][core][wip] first successful shuff...

Reply via email to