Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/9214#issuecomment-153909914
> I think that with "last task wins" you'd still need a lock when opening
files for reading & writing to make sure you don't open one task's index file
and another task's data file. (a lot of work can happen between opening the
data file for writing and opening the index file for writing with the current
code, but that can be changed.)
What's worse, I think that this locking scheme would have to coordinate
across processes, since we'd need to make sure that the external shuffle
service acquires the proper read locks.
I've noticed that this current patch does not employ any locks for reading,
but I don't think that's currently a problem:
- The only case where we would need a lock is to prevent the case where we
read the sort-shuffle index file and then have the data file replaced by a
concurrent write.
- Since sort-shuffle only creates one data file, that file will never be
overwritten once created.
Does that logic sound right?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]