Github user squito commented on the pull request:
https://github.com/apache/spark/pull/9214#issuecomment-151196831
Hi @mateiz, thanks for taking a look
1) Do you know if that works reliably on all platforms? Josh had suggested
that trick as well during our brainstorm earlier, but we weren't sure if we
could rely on it. I think it doesn't work on windows, though I haven't tested
and might be remembering wrong. I'm definitely not an expert on this area
though, happy to defer. I did try this and it worked on my mac, anyway:
https://gist.github.com/squito/222a28f04a6517aafba2
I think that with "last task wins" you'd still need a lock when opening
files for reading & writing to make sure you don't open one task's index file
and another task's data file. (a lot of work can happen between opening the
data file for writing and opening the index file for writing with the current
code, but that can be changed.)
2) Yeah, that is a great point. I will change this PR to store the map
status in a file, that'll be a straightforward fix.
I suppose you are right that if a DISK_ONLY file gets corrupted, your
entire spark app is also doomed. But that seems unfortunate to me, not
something that we want to emulate. We already see that users are running
really long spark apps, on ever increasing cluster sizes (consider
https://github.com/apache/spark/pull/9246). Maybe its very uncommon so its not
worth re-engineering things just for that. But I do still feel that even
though having one output per attempt is a slightly bigger change right now, the
changes are mostly plumbing, and it makes for something that is safer and much
easier to reason to about.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]