Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/1486#issuecomment-54901482
Okay the thing I said before won't work because we can't return a rich type
from `getPreferredLocations`.
So after looking some more, how about this:
1. We extend `TaskLocation` to have a boolean field called `cached`.
2. We add a simple scheme to tag in-memory locations in
getPrefferredLocations. For now let's just keep it simple and introduce a
single type of tag. We modify the `apply` function in `TaskLocation` to parse
this correctly. This would be similar to the logic you have right now in
`PartitionLocation`.
3. In the `TaskSetManager` in `addPendingTasks` we check whether `cached`
is set to true. If it is we lookup if we have an executors on the host (via
`sched.executorsByHost`)... if we have an executor there we add this to the
list of pending executors.
This defers handling other types of hierarchical storage since the `cached`
thing is hard coded in `TaskLocation`, but getting that working throughout all
of Spark requires IMO a much larger design discussion. There are many open
questions like whether we need to provide a richer type signature for
`getPreferredLocations`, how delay scheduling will work, etc.
Overall this proposal would be similar to what is there now, except you
wouldn't add a new class called `PartitionLocation` (in lieu of just using the
existing `TaskLocation`). Also, you'd do a binding in the `TaskSetManager` to
specific executors.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]