[
https://issues.apache.org/jira/browse/MESOS-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972159#comment-13972159
]
Benjamin Hindman commented on MESOS-336:
----------------------------------------
Glad to see you working on this [~bernd-mesos]!
First, once we have the checksum capability I'm not convinced we'd want to
expose a "download once" capability. In fact, it might be something we wish we
didn't expose in the interface (because, for example, what does it mean to
download once if, when we have caching, we decide to evict it?).
Second, I'm not opposed to adding more caching support/functionality in the
slave in the future to better manage persistent state but I'm not sure there is
much (any?) persistent state we really need for a simple alpha here (aside from
the actual downloaded files themselves) unless I'm missing something.
Here's what I was thinking:
(1) Add a flag to the slave with a path to a directory which will represent the
cache (this lets someone put it in on a RAM FS if they please, but could be
defaulted to just be the 'work_dir/cache').
(2) Add a flag to the slave for the total number of bytes (of URI downloads) to
cache (again, defaulted to something reasonable).
(3) Add the md5/hash in CommandInfo as suggested in MESOS-700.
(4) Pass the directory ("cache") to the fetcher when it gets invoked.
(5) Update the fetcher: If the requested file exists and matches the checksum
then "touch" it and copy into the sandbox, extract, chmod, etc. If the file
doesn't exist, or doesn't match the checksum, download it (overwriting when
wrong checksum, we'll "fix" that later), copy into sandbox, extract, chmod, etc.
(6) Update the fetcher to always run a "cache eviction" before it exits that
simply deletes the oldest modified files until we're below the cache limit).
A lot of the generic caching code that gets written for this could eventually
be moved into the slave (since it'll all be C++), but I don't see any reason
not to start with it in the fetcher for now.
How does this sound?
> Mesos slave should cache executors
> ----------------------------------
>
> Key: MESOS-336
> URL: https://issues.apache.org/jira/browse/MESOS-336
> Project: Mesos
> Issue Type: Improvement
> Components: slave
> Reporter: brian wickman
> Assignee: Bernd Mathiske
> Labels: newbie
>
> The slave should be smarter about how it handles pulling down executors. In
> our environment, executors rarely change but the slave will always pull it
> down from regardless HDFS. This puts undue stress on our HDFS clusters, and
> is not resilient to reduced HDFS availability.
--
This message was sent by Atlassian JIRA
(v6.2#6252)