Re: Review Request 30774: Fetcher Cache

Bernd Mathiske Fri, 06 Mar 2015 22:45:36 -0800


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > include/mesos/fetcher/fetcher.proto, line 58
> > <https://reviews.apache.org/r/30774/diff/32/?file=887350#file887350line58>
> >
> >     It's harder to make a optional field required, but it's much easier the 
> > other way around.
> >     
> >     If we always want it to be required, I think we should make the sandbox 
> > a required field.


There was some discussion about whether this field should be required or not. 
The general idea here is that a task might be able to run without fetching 
anything into its sandbox. In this case, the framework may get away without 
naming the sandbox. But since a task always has one, we could also make it 
required. I am impartial in this choice, but I see that your argument that 
required->optional is easier has pull.


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > src/hdfs/hdfs.hpp, line 82
> > <https://reviews.apache.org/r/30774/diff/32/?file=887353#file887353line82>
> >
> >     IMO this should be in another patch and we can get this commited right 
> > away.

I WAS in another patch: 30616. BenH advised to put all fetcher cache related 
patches that are not for stout or libprocess together in one patch.


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.hpp, line 76
> > <https://reviews.apache.org/r/30774/diff/32/?file=887358#file887358line76>
> >
> >     Are these static methods going to be used somewhere else? Quite a lot 
> > of static methods now in the header and perhaps we just need to put the 
> > implementation in the cpp file, BenH also mentioned this last time in the 
> > review meeting.

Yes, these are all used both in launcher/fetcher.cpp as well as 
containerizer/fetcher.cpp. They are factored out for consistency and easier 
maintenance. In previous fetcher implementations they were not, enjoying a 
duplicitous existence.


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.hpp, line 191
> > <https://reviews.apache.org/r/30774/diff/32/?file=887358#file887358line191>
> >
> >     I think there is a bit of complication with this interface, where a 
> > user has to call two things at the end of a downloading from an Entry:
> >     
> >     - unreference
> >     - complete/fail
> >     
> >     And I don't see how one ever wants to use them without each other.
> >     
> >     Why not hide unreference all together, and decrement the reference 
> > count in complete or fail?
> >     
> >     This way it's a lot less error prone, and harder to make mistakes with 
> > future changes.

unreference is not only used for new cache entries that find closure in 
completion or failure. It is primarily used for pre-existing entries that have 
downloads by concurrent fetch runs. In both cases, we need to call unreference 
at the very end of fetching in our current run. If we hid unreference in only 
one of the two cases, we'd have a bug.

How can anything be less error-prone than calling unreference on everything 
that got referenced? By hard-coupling "reference" with storing the unreference 
action in a collection. I had a version exactly like this but it was turned 
down by a reviewer, because it introduced an extra class.


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 749
> > <https://reviews.apache.org/r/30774/diff/32/?file=887359#file887359line749>
> >
> >     What's the point of this empty branch?

I like putting comments about what happens in the code path / case not taken in 
an empty branch instead of placing them in a less directly related place. This 
is much more clear IMHO.


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > src/slave/containerizer/fetcher.cpp, line 1087
> > <https://reviews.apache.org/r/30774/diff/32/?file=887359#file887359line1087>
> >
> >     This seems to be very easy to make mistakes with setSpace, especially 
> > when it's not even expected to be called more than once? 
> >     
> >     I thought you said you're going to change it so it's only set once 
> > during initialization?

BenH and I decided to postpone this refactoring. It would lead to a lot of 
additional code changes for ni current semantic or general architecture changes.


> On March 6, 2015, 2:15 p.m., Timothy Chen wrote:
> > src/tests/mesos.hpp, line 703
> > <https://reviews.apache.org/r/30774/diff/32/?file=887367#file887367line703>
> >
> >     We want to have a consistent naming style across all the tests and 
> > files, and we usually in all the tests in Mesos, the "unmocked" methods 
> > tend to just have a prefix of "_", so run -> _run

If you use "_run" how do you distinguish this from a continuation of the same 
name? We cannot possibly use this naming scheme. Please either convert to mine 
or come up with a better one. I think that unmocked-something makes it very 
clear what is going on without making first readers guess or having to put 
extra comments. So I'd prefer leaving it like that.

(I had mocked a method _fetch in an earlier patch and there is also a 
continuation __fetch...)


- Bernd


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30774/#review75491
-----------------------------------------------------------


On March 6, 2015, 5:46 a.m., Bernd Mathiske wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30774/
> -----------------------------------------------------------
> 
> (Updated March 6, 2015, 5:46 a.m.)
> 
> 
> Review request for mesos, Adam B, Benjamin Hindman, Till Toenshoff, and 
> Timothy Chen.
> 
> 
> Bugs: MESOS-2057, MESOS-2069, MESOS-2070, MESOS-2072, MESOS-2073, and 
> MESOS-2074
>     https://issues.apache.org/jira/browse/MESOS-2057
>     https://issues.apache.org/jira/browse/MESOS-2069
>     https://issues.apache.org/jira/browse/MESOS-2070
>     https://issues.apache.org/jira/browse/MESOS-2072
>     https://issues.apache.org/jira/browse/MESOS-2073
>     https://issues.apache.org/jira/browse/MESOS-2074
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Almost all of the functionality in epic MESOS-336. Downloaded files from 
> CommandInfo::URIs can now be cached in a cache directory designated by a 
> slave flag. This only happens when asked for by an extra flag in the URI and 
> is thus backwards-compatible. The cache has a size limit also given by a new 
> slave flag. Cache-resident files are evicted as necessary to make space for 
> newly fetched ones. Concurrent attempts to cache the same URI leads to only 
> one download. The fetcher program remains external for safety reasons, but is 
> now augmented with more elaborate parameters packed into a JSON object to 
> implement specific fetch actions for all of the above. Additional testing 
> includes fetching from (mock) HDFS and coverage of the new features.
> 
> 
> Diffs
> -----
> 
>   docs/fetcher-cache-internals.md PRE-CREATION 
>   docs/fetcher.md PRE-CREATION 
>   include/mesos/fetcher/fetcher.proto 
> 311af9aebc6a85dadba9dbeffcf7036b70896bcc 
>   include/mesos/mesos.proto 9df972d750ce1e4a81d2e96cc508d6f83cad2fc8 
>   src/Makefile.am d299f07d865080676ca8a550cf6005c6ab32839f 
>   src/hdfs/hdfs.hpp 968545d9af896f3e72e156484cc58135405cef6b 
>   src/launcher/fetcher.cpp 796526f59c25898ef6db2b828b0e2bb7b172ba25 
>   src/slave/constants.hpp fd1c1aba0aa62372ab399bee5709ce81b8e92cec 
>   src/slave/containerizer/docker.hpp b7bf54ac65d6c61622e485ac253513eaac2e4f88 
>   src/slave/containerizer/docker.cpp 5f4b4ce49a9523e4743e5c79da4050e6f9e29ed7 
>   src/slave/containerizer/fetcher.hpp 
> 1db0eaf002c8d0eaf4e0391858e61e0912b35829 
>   src/slave/containerizer/fetcher.cpp 
> 9e9e9d0eb6b0801d53dec3baea32a4cd4acdd5e2 
>   src/slave/containerizer/mesos/containerizer.hpp 
> ae61a0fcd19f2ba808624312401f020121baf5d4 
>   src/slave/containerizer/mesos/containerizer.cpp 
> ec4626f903d44c0911093ff763ef16ad27c418a9 
>   src/slave/flags.hpp 56b25caf3901b38bdecb50310e8bcae0b114efa8 
>   src/slave/slave.cpp a06d68032f26ccb3f786b6ea7c3a6c3c52449bd2 
>   src/tests/docker_containerizer_tests.cpp 
> 06cd3d89ecbaaac17ae6970604b21fbe29f6e887 
>   src/tests/fetcher_cache_tests.cpp PRE-CREATION 
>   src/tests/fetcher_tests.cpp 4549e6a631e2c17cec3766efaa556593eeac9a1e 
>   src/tests/mesos.hpp e91e5e484eea4587ac8f2eb9cefeab4acc9f4615 
>   src/tests/mesos.cpp c8f43d21b214e75eaac2870cbdf4f03fd18707d1 
> 
> Diff: https://reviews.apache.org/r/30774/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> --- longer Description: ---
> 
> -Replaces all other reviews for the fetcher cache except those related to 
> stout: 30006, 30033, 30034, 30036, 30037, 30039, 30124, 30173, 30614, 30616, 
> 30618, 30621, 30626. See descriptions of those. In dependency order:
> 
> 30033: Removes the fetcher env tests since these won't be needed any more 
> when the fetcher uses JSON in a single env var as a parameter. They never 
> tested anything that won't be covered by other tests anyway.
> 
> 30034: Makes the code structure of all fetcher tests the same. Instead of 
> calling the run method of the fetcher directly, calling through fetch(). Also 
> removes all uses of I/O redirection, which is not really needed for 
> debugging, and thus the next patch can refactor fetch() and run(). (The 
> latter comes in two varieties, which complicates matters without much 
> benefit.)
> 
> 30036: Extends the CommandInfo::URI protobuf with a boolean "caching" field 
> that will later cause fetcher cache actions. Also introduces the notion of a 
> cache directory to the fetcher info protobuf. And then propagates these 
> additions throughout the rest of the code base where applicable. This 
> includes passing the slave ID all the way down to the place where the cache 
> dir name is constructed.
> 
> 30037: Extends the fetcher info protobuf with "actions" (fetch directly 
> bypassing the cache, fetch through the cache, retrieve from the cache). 
> Switches the basis for dealing with uris to "items", which contain the uri, 
> the action, and potentially a cache file name. Refactors fetch() and run(), 
> so there is only one of each. Introduces about half of the actual cache 
> logic, including a hashmap of cache file objects for bookkeeping and basic 
> operations on it. 
> 
> 30039: Enables fetcher cache actions in the mesos fetcher program.
> 
> 30006: Enables concurrent downloading into the fetcher cache. Reuse of 
> download results in the cache when multiple fetcher runs occur concurrently. 
> 
> 30614: This is to ensure that all this refactoring of fetcher code has not 
> broken HDFS fetching. Adds a test that exercises the C++ code paths in Mesos 
> and mesos-fetcher related to fetching from HDFS. Uses a mock HDFS client 
> written in bash that acts just like a real "hadoop" command if used in the 
> right limited way.
> 
> 30124: Inserted fetcher cache zap upon slave startup, recovery and shutdown. 
> This implements recovery in an acceptable, yet most simple way.
> 
> 30173: Created fetcher cache tests. Adds a new test source file containing a 
> test fixture and tests to find out if the fetcher cache works with a variety 
> of settings.
> 
> 30616: Adds hdfs::du() which calls "hadoop fs -du -h" and returns a string 
> that contains the file size for the URI passed as argument. This is needed to 
> determine the size of a file on HDFS before downloading it to the fetcher 
> cache (to ensure there is enough space).
> 
> 30621: Refactored URI type separation in mesos-fetcher. Moved the URI type 
> separation code (distinguishes http, hdfs, local copying, etc.) from 
> mesos-fetcher to the fetcher process/actor, since it is going to be reused by 
> download size queries when we introduce fetcher cache management. Also 
> factored out URI validation, which will be used the same way by mesos-fetcher 
> and the fetcher process/actor.
> 
> 30626: Fetcher cache eviction. This happens when the cache does not have 
> enough space to accomodate upcoming downloads to the cache. Necessary 
> provisions included here:
> - mesos-fetcher does not run until evictions have been successful
> - Cache space is reserved while (async) waiting for eviction to succeed. If 
> it fails, the reservation gets undone.
> - Reservations can be partly from available space, partly from evictions. All 
> math included :-)
> - To find out how much space is needed, downloading has a prelude in which we 
> query the download size from the URI. This works for all URI types that 
> mesos-fetcher currently supports, including http and hdfs.
> - Size-determination requests are now synchronized, too. Only one per URI in 
> play happens.
> - There is cleanup code for all kinds of error situations. At the very end of 
> the fetch attempt, each list is processed for undoing things like space 
> reservations and eviction disabling.
> - Eviction gets disabled for URIs that are currently in use, i.e. the related 
> cache files are. We use reference counting for this, since there may be 
> concurrent fetch attempts using the same cache files.
> 
> 
> Thanks,
> 
> Bernd Mathiske
> 
>

Re: Review Request 30774: Fetcher Cache

Reply via email to