It's not clear if we still want to ensure GC events are ordered globally. We might want to break that order if we have per-item constraints to deal with (e.g. link count).
Can you elaborate on the delay approach you're mentioning? On Wed, Mar 4, 2015 at 9:30 AM, Ritwik <[email protected]> wrote: > Hi Ben, > > I had already addressed all the comments. I was trying to figure out a way > to reproduce the issue with a failing test before posting a new diff. I am > sorry for the delay since it took me some time to figure out concepts like > containerization etc. which were used in the test corresponding to disk > usage. > > I understand the point you are trying to make. However, this contradicts a > statement which you made earlier. If we want the number of directories to > affect the garbage collection "locally" then the only way is to re-order > the sequence of directories which were scheduled for GC by making their > delay a function of the number of executor directories. I will post a > detailed solution (if I am able to figure out one) on the ticket before > delving into the code next time. > > In the meanwhile, I will take up simpler issues to solve. > > Thanks for all the help and your time. > > Best Regards, > > On 4 March 2015 at 08:06, Benjamin Mahler <[email protected]> > wrote: > >> Hey Ritwik, please hold off dealing with review comments. The current >> approach here still seems to be not quite what we want. >> >> In the current patch, you're speeding up garbage collection "globally" >> when an individual directory has too many child directories. This approach >> works fine with "disk usage" as it is a global constraint we must deal >> with. However, the "link count" is a per-directory constraint. This means >> we want to be able to purge on a per-directory basis rather than globally. >> >> I apologize that I haven't had time to provide specific implementation >> guidance and it's looking like this isn't a "newbie" ticket after all. >> Bernd is pitching some time in here to help. In the meantime you might want >> to pick up some other tickets that will require less shepherding from us! >> >> On Sat, Feb 28, 2015 at 4:54 AM, Bernd Mathiske <[email protected]> >> wrote: >> >>> This is an automatically generated e-mail. To reply, visit: >>> https://reviews.apache.org/r/30014/ >>> >>> First pass. >>> >>> >>> src/slave/constants.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880941#file880941line64> >>> (Diff >>> revision 5) >>> >>> extern const Duration REGISTRATION_BACKOFF_FACTOR; >>> >>> 64 >>> >>> extern const double HARD_LINKS_THRESHOLD; >>> >>> The other constants have a comment. Why not this one? >>> >>> >>> src/slave/flags.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line158> >>> (Diff >>> revision 5) >>> >>> public: >>> >>> 158 >>> >>> " 0.0, min((1.0 - gc_disk_headroom - disk usage),\n" >>> >>> Danger: is "-" left-associative here or right-associative? Better not to >>> make any assumptions and put extra parentheses. >>> >>> It seems to me that the first expression is meant to be left-associative >>> and the second right-associative. See more discussion why this is so below. >>> >>> Where is max_framework_link_usage defined? This comes out of context here. >>> >>> >>> src/slave/flags.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880943#file880943line170> >>> (Diff >>> revision 5) >>> >>> public: >>> >>> 170 >>> >>> " 0.0, min((1.0 - gc_disk_headroom - disk usage),\n" >>> >>> Same as above. >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line237> >>> (Diff >>> revision 5) >>> >>> public: >>> >>> 236 >>> >>> // Garbage collects the directories based on the current disk usage. >>> >>> 237 >>> >>> // Garbage collects the directories based on the current disk usage and >>> number >>> >>> It is probably worth mentioning here that you mean the maximum current >>> number of executor directories in any active framework. >>> >>> While you are changing this sentence here, could you maybe also improve on >>> the phrasing "the directories"? Using the definite article "the" without >>> referring to anything in context makes it unclear to the reader what >>> directories are meant. >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line242> >>> (Diff >>> revision 5) >>> >>> public: >>> >>> 240 >>> >>> void _checkDiskUsage(const process::Future<double>& usage); >>> >>> 242 >>> >>> void _checkGCParameters(const process::Future<GCParameters>& gc_params); >>> >>> >>> 1. >>> >>> We don't use snake case, we use camel case for variable naming. Please >>> also correct all other variables in this review. >>> 2. >>> >>> No need to abbreviate "parameters" to "params". >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line639> >>> (Diff >>> revision 5) >>> >>> 639 >>> >>> // Getters >>> >>> Redundant comment. >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line640> >>> (Diff >>> revision 5) >>> >>> 640 >>> >>> double getDiskUsage() const; >>> >>> Why use getters and setters here at all? What is the benefit if all the >>> setters are public? >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line643> >>> (Diff >>> revision 5) >>> >>> 643 >>> >>> // Setters >>> >>> Redundant comment. >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line648> >>> (Diff >>> revision 5) >>> >>> 648 >>> >>> // 'disk_usage' denotes a fraction of the disk that is currently being >>> used >>> >>> "a fraction"? What other fractions are there? Do you mean "the fraction"? >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line649> >>> (Diff >>> revision 5) >>> >>> 649 >>> >>> // by the slave process. This usage is defined as a fraction of a limit >>> >>> "This usage is defined as a fraction of a limit specified by a flag." >>> >>> This is too nebulous. Suggestion: reference/point to a place with the >>> actual definition. >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line652> >>> (Diff >>> revision 5) >>> >>> 652 >>> >>> // expressed as a fraction of the limit on the maximum number of >>> >>> "the maximum number of subdirectories" >>> >>> Please add " in any given parent directory" to be extra clear. Otherwise >>> one could possibly read this mistakenly as: >>> >>> "the maximum number of subdirectories anywhere in a disk volume". >>> >>> >>> src/slave/slave.hpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880946#file880946line656> >>> (Diff >>> revision 5) >>> >>> 656 >>> >>> Try<double> disk_usage, max_link_usage; >>> >>> >>> 1. >>> >>> Please declare these two members separately, with separate comments. >>> 2. >>> >>> "max_link_usage"? This sounds like a limit, not like a current gauge >>> which it is. Suggestion: "topLinkUsage". >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3596> >>> (Diff >>> revision 5) >>> >>> void Slave::registerExecutorTimeout( >>> >>> 3596 >>> >>> 0.0, std::min((1.0 - flags.gc_disk_headroom - >>> gc_params.getDiskUsage()), >>> >>> The indentation in the next line makes this hard to read. I'd say line >>> this up this way: >>> >>> return flags.gc_delay * std::max( >>> 0.0, std::min((1.0 - ...), >>> (1.0 - ...; >>> >>> Even better: use variables with good descriptive naming for intermediate >>> expression results. Or break out the formluae into small functions, even if >>> they do not occur in multiple places. >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3597> >>> (Diff >>> revision 5) >>> >>> void Slave::registerExecutorTimeout( >>> >>> 3597 >>> >>> (1.0 - flags.gc_disk_headroom - gc_params.getMaxLinkUsage()))); >>> >>> >>> 1. This can't be right. Subtracting link usage from disk head room? >>> >>> Such a glitch is exactly why I would like to suggest writing a unit test >>> later on. Otherwise we may have this feature implemented, but it may not be >>> working at all. >>> >>> 1. Left or right-associative use of "-" intended here? Suggestion: use >>> extra parentheses! It looks like left-associative is correct here, but it >>> won't be once you have replaced gc_disk_headroom with hard_links_threshold >>> (hardLinksThreshold)... Then you have >>> >>> (1.0 - 0.8) - maxLinkUsage // seems wrong >>> >>> Proposal: Make these definitions more parallel in meaning. Use a headroom >>> concept for hard links as well or change disk headroom to a disk usage >>> threshold. I'd opt for the former, since it would touch less code. >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3637> >>> (Diff >>> revision 5) >>> >>> 3637 >>> >>> GCParameters params; >>> >>> params -> parameters >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3638> >>> (Diff >>> revision 5) >>> >>> 3638 >>> >>> params.setDiskUsage(fs::usage(flags.work_dir)); >>> >>> insert blank line above and below this line >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3640> >>> (Diff >>> revision 5) >>> >>> 3640 >>> >>> std::string executors_parent_path = paths::getExecutorsParentPath( >>> >>> std::string -> string >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3642> >>> (Diff >>> revision 5) >>> >>> 3642 >>> >>> struct stat s; >>> >>> insert blank line above this one and swap the two declarations, because >>> lstat uses s, not limit. >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3645> >>> (Diff >>> revision 5) >>> >>> 3645 >>> >>> params.setMaxLinkUsage(ErrnoError( >>> >>> Hm. Is it really the best strategy to report nothing at all just because >>> we are unsure about one single framweork? >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3650> >>> (Diff >>> revision 5) >>> >>> 3650 >>> >>> limit = ::pathconf(executors_parent_path.c_str(), _PC_LINK_MAX); >>> >>> Why not declare "limit" here? Why declare it in an outer scope and >>> needlessly initialize it there? >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3651> >>> (Diff >>> revision 5) >>> >>> 3651 >>> >>> if (limit < 0 && errno != 0) { >>> >>> What about "limit < 0" and "errno == 0"? This is a possible response from >>> pathconf. Then you keep going, dividing by zero in line 3656. >>> >>> Did you mean this? >>> >>> "limit < 0 || errno != 0" >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3656> >>> (Diff >>> revision 5) >>> >>> 3656 >>> >>> double link_usage_fraction = 1.0 * s.st_nlink / limit; >>> >>> What if we are on a system where the limit gets ignored and mkdir still >>> works and GC for some reason has not caught up yet? Do we get usages above >>> 1.0 then? Is this expected and handled well in therest of the code? >>> >>> >>> src/slave/slave.cpp >>> <https://reviews.apache.org/r/30014/diff/5/?file=880947#file880947line3658> >>> (Diff >>> revision 5) >>> >>> 3658 >>> >>> std::max(params.getMaxLinkUsage(), link_usage_fraction)); >>> >>> Won't this crash if we have set maxLinkUsage to Error in line 3645? >>> >>> >>> . >>> >>> - Bernd Mathiske >>> >>> On February 27th, 2015, 2:48 p.m. PST, Ritwik Yadav wrote: >>> Review request for mesos, Ben Mahler, Dominic Hamon, and Vinod Kone. >>> By Ritwik Yadav. >>> >>> *Updated Feb. 27, 2015, 2:48 p.m.* >>> *Bugs: * MESOS-391 <https://issues.apache.org/jira/browse/MESOS-391> >>> *Repository: * mesos >>> Description >>> >>> A fix to enable the slave garbage collector to take the number of hard >>> links (st_nlinks) into account when creating new directories. >>> >>> Testing >>> >>> Mesos builds sucessfully and all existing tests pass. >>> >>> Diffs >>> >>> - src/slave/constants.hpp (fd1c1aba0aa62372ab399bee5709ce81b8e92cec) >>> - src/slave/constants.cpp (2a99b1105af0e2d62b956fa96988f477937a46bd) >>> - src/slave/flags.hpp (56b25caf3901b38bdecb50310e8bcae0b114efa8) >>> - src/slave/paths.hpp (1618439d728ded347ec75317ce8dd998acd7ee94) >>> - src/slave/paths.cpp (01ea856aa2e628d4aee5fd31f7e49d147f740e8f) >>> - src/slave/slave.hpp (7b58cade9ada5f0a56b4c60c502eb907ccae33b4) >>> - src/slave/slave.cpp (9f31fa46304398e8f87b41b55d8f4cfd4aba10b9) >>> - src/tests/gc_tests.cpp (deaa6b1b6c32ae6d153229248d7d4f57caa0ebcf) >>> >>> View Diff <https://reviews.apache.org/r/30014/diff/> >>> >> >> > > > -- > *Ritwik Yadav* > > Department of Computer Science and Engineering, > Indian Institute of Technology, > Kharagpur. > > Cell: +91-9635-152346 > Twitter: @iRitwik <https://twitter.com/#%21/iRitwik> >
