----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4768/#review7179 -----------------------------------------------------------
There are a lot of comments here, but hopefully it will help make this MPI on Mesos very maintainable in the future! Thanks so much Harvey! frameworks/mpi/README.txt <https://reviews.apache.org/r/4768/#comment15854> Please kill all whitespace in this review. frameworks/mpi/README.txt <https://reviews.apache.org/r/4768/#comment15885> I'd prefer if we just had everyone do 'make', since that should build the Python dependencies (including protobuf). frameworks/mpi/README.txt <https://reviews.apache.org/r/4768/#comment15864> Does nmpiexec work out of the box without changes (not included in this review). Also, let's change the names from "nmpiexec*" to "mpiexec-mesos*"! ;) Finally, just pass host:port, no need for the 'master@' prefix. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15883> This default isn't the same as what gets printed out from --help. Probably makes sense to kill these here and just put the value down in the add_option call (like you do for --num and TOTAL_TASKS). frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15884> Optional path. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15855> No need to take driver as an argument anymore. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15856> I know this wasn't the style of the codebase you've inherited, but I'd like spaces around all operators please. For example, this line should read: print "Got " + str(TOTAL_TASKS) + " mpd slots, running mpiexec" It looks like you've already done this with some of the code you've added (which is awesome!), but please clean up all the code. Thanks! frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15863> It would be great to give this a real name, e.g., MPIScheduler. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15880> How about s/tasksLaunched/mpdsLaunched frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15839> No longer used, kill please. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15840> No longer use, kill please. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15843> s/Rejecting/Declining frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15841> Use driver.declineOffer please. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15842> s/slot/resources frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15887> Kill this line (or alternatively add the offer.id.value up on line 87). frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15888> s/r/resource frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15844> s/Rejecting slot/Declining offer Also, why not do driver.declineOffer right here? frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15879> s/slot/offer frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15889> How about: print "Launching mpd " + tid + " on host " + offer.hostname frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15877> s/Rejecting slot/Declining offer frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15878> Please use driver.declineOffer. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15852> Since mpderr is unused, how about instead: mpdtraceout, _ = mpdtraceproc.communicate() or mpdtraceout = mpdtraceproc.communicate()[0] frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15853> No need for the intermediate 'count'. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15868> s-slots/mpd:s-mpd's frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15869> s/slot/mpd Also, do we want to set the '--ncpus' option on the actual mpd that we launch on a Mesos slave? frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15870> s/slot/mpd Same as above, is there something we can/should set when we launch the mpd? frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15875> I'm not really sure how this can be used: the user running this script will not know what machines they might run on, so they can't possibly know which IP addresses they want to use on those machines. Maybe Jessica J. had something else in mind here? It definitely makes sense to keep --ifhn for the master. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15882> What about s/TOTAL_TASKS/TOTAL_MPDS frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15862> It looks like you assume that path ends in a '/'. You should probably check this here. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15850> s/mesos/Mesos frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15861> I'd like to make this whole thing simpler. In particular, I don't see any need for an executor here (i.e., the startmpd.py script). We actually have a 1-1 mapping from tasks to mpd's, so let's just have our TaskInfo's have a CommandInfo which just launches the mpd. Something like this for that CommandInfo's value: ...command.value = "${MPICH2PATH}mpd -n --host=${MPD_MASTER_IP} --port=${MPD_MASTER_PORT}" (Note your code, as does what I have above, assumes that MPICH2PATH includes the trailing '/'. Also, you might need to change the string based on if you're passing --ifhn. Finally, note I used the long options --host and --port. This is for readability for other people that might not know mpd very well. Likewise, if there is a long option for -n it would be great to use that instead.) Of course, this will require setting the command's environment variables appropriately. But, this should let us completely eliminate the startmpd.py script!!!!! Yeah! Less things to maintain! frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15858> s/executor/executor. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15859> Why the indentation? And add a period at the end of the sentence please. frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15867> Did someone actually ask for this feature? frameworks/mpi/nmpiexec.py <https://reviews.apache.org/r/4768/#comment15857> Kill extra space after 'args[0]'. - Benjamin On 2012-04-21 05:08:47, Harvey Feng wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/4768/ > ----------------------------------------------------------- > > (Updated 2012-04-21 05:08:47) > > > Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica. > > > Summary > ------- > > Some updates to point out: > > -nmpiexec.py > -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved > 'driver.stop()' to statusUpdate() so that it stops when all tasks have been > finished, which occurs when the executor's launched mpd processes have all > exited. > -startmpd.py > -> Didn't remove cleanup(), and added code in shutdown() that manually > kills mpd processes. They might be useful during abnormal (cleanup) and > normal (shutdown) framework/executor termination...I think. cleanup() still > terminates all mpd's in the slave, but shutdown doesn't. > -> killtask() stops the mpd associated with the given tid. > -> Task states update nicely now. They correspond to the state of a task's > associated mpd process. > -Readme > -> Included additional info on how to setup and run MPICH2 1.2 and nmpiexec > on OS X and Ubuntu/Linux > > > This addresses bug MESOS-183. > https://issues.apache.org/jira/browse/MESOS-183 > > > Diffs > ----- > > frameworks/mpi/README.txt cdb4553 > frameworks/mpi/nmpiexec.py a5db9c0 > frameworks/mpi/startmpd.py 8eeba5e > > Diff: https://reviews.apache.org/r/4768/diff > > > Testing > ------- > > > Thanks, > > Harvey > >
