[ 
https://issues.apache.org/jira/browse/MESOS-183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13261056#comment-13261056
 ] 

[email protected] commented on MESOS-183:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/4768/#review7179
-----------------------------------------------------------


There are a lot of comments here, but hopefully it will help make this MPI on 
Mesos very maintainable in the future! Thanks so much Harvey!


frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15854>

    Please kill all whitespace in this review.



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15885>

    I'd prefer if we just had everyone do 'make', since that should build the 
Python dependencies (including protobuf).



frameworks/mpi/README.txt
<https://reviews.apache.org/r/4768/#comment15864>

    Does nmpiexec work out of the box without changes (not included in this 
review).
    
    Also, let's change the names from "nmpiexec*" to "mpiexec-mesos*"! ;)
    
    Finally, just pass host:port, no need for the 'master@' prefix.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15883>

    This default isn't the same as what gets printed out from --help. Probably 
makes sense to kill these here and just put the value down in the add_option 
call (like you do for --num and TOTAL_TASKS).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15884>

    Optional path.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15855>

    No need to take driver as an argument anymore.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15856>

    I know this wasn't the style of the codebase you've inherited, but I'd like 
spaces around all operators please. For example, this line should read:
    
    print "Got " + str(TOTAL_TASKS) + " mpd slots, running mpiexec"
    
    It looks like you've already done this with some of the code you've added 
(which is awesome!), but please clean up all the code. Thanks!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15863>

    It would be great to give this a real name, e.g., MPIScheduler.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15880>

    How about s/tasksLaunched/mpdsLaunched



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15839>

    No longer used, kill please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15840>

    No longer use, kill please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15843>

    s/Rejecting/Declining



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15841>

    Use driver.declineOffer please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15842>

    s/slot/resources



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15887>

    Kill this line (or alternatively add the offer.id.value up on line 87).



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15888>

    s/r/resource



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15844>

    s/Rejecting slot/Declining offer
    
    Also, why not do driver.declineOffer right here?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15879>

    s/slot/offer



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15889>

    How about:
    
    print "Launching mpd " + tid + " on host " + offer.hostname



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15877>

    s/Rejecting slot/Declining offer



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15878>

    Please use driver.declineOffer.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15852>

    Since mpderr is unused, how about instead:
    
    mpdtraceout, _ = mpdtraceproc.communicate()
    
    or
    
    mpdtraceout = mpdtraceproc.communicate()[0]



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15853>

    No need for the intermediate 'count'.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15868>

    s-slots/mpd:s-mpd's



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15869>

    s/slot/mpd
    
    Also, do we want to set the '--ncpus' option on the actual mpd that we 
launch on a Mesos slave?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15870>

    s/slot/mpd
    
    Same as above, is there something we can/should set when we launch the mpd?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15875>

    I'm not really sure how this can be used: the user running this script will 
not know what machines they might run on, so they can't possibly know which IP 
addresses they want to use on those machines. Maybe Jessica J. had something 
else in mind here?
    
    It definitely makes sense to keep --ifhn for the master.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15882>

    What about s/TOTAL_TASKS/TOTAL_MPDS



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15862>

    It looks like you assume that path ends in a '/'. You should probably check 
this here.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15850>

    s/mesos/Mesos



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15861>

    I'd like to make this whole thing simpler. In particular, I don't see any 
need for an executor here (i.e., the startmpd.py script). We actually have a 
1-1 mapping from tasks to mpd's, so let's just have our TaskInfo's have a 
CommandInfo which just launches the mpd. Something like this for that 
CommandInfo's value:
    
    ...command.value = "${MPICH2PATH}mpd -n --host=${MPD_MASTER_IP} 
--port=${MPD_MASTER_PORT}"
    
    (Note your code, as does what I have above, assumes that MPICH2PATH 
includes the trailing '/'. Also, you might need to change the string based on 
if you're passing --ifhn. Finally, note I used the long options --host and 
--port. This is for readability for other people that might not know mpd very 
well. Likewise, if there is a long option for -n it would be great to use that 
instead.)
    
    Of course, this will require setting the command's environment variables 
appropriately. But, this should let us completely eliminate the startmpd.py 
script!!!!! Yeah! Less things to maintain!



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15858>

    s/executor/executor.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15859>

    Why the indentation? And add a period at the end of the sentence please.



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15867>

    Did someone actually ask for this feature?



frameworks/mpi/nmpiexec.py
<https://reviews.apache.org/r/4768/#comment15857>

    Kill extra space after 'args[0]'.


- Benjamin


On 2012-04-21 05:08:47, Harvey Feng wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/4768/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2012-04-21 05:08:47)
bq.  
bq.  
bq.  Review request for mesos, Benjamin Hindman, Charles Reiss, and Jessica.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Some updates to point out:
bq.  
bq.  -nmpiexec.py
bq.    -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved 
'driver.stop()' to statusUpdate() so that it stops when all tasks have been 
finished, which occurs when the executor's launched mpd processes have all 
exited. 
bq.  -startmpd.py
bq.    -> Didn't remove cleanup(), and added code in shutdown() that manually 
kills mpd processes. They might be useful during abnormal (cleanup) and normal 
(shutdown) framework/executor termination...I think. cleanup() still terminates 
all mpd's in the slave, but shutdown doesn't. 
bq.    -> killtask() stops the mpd associated with the given tid. 
bq.    -> Task states update nicely now. They correspond to the state of a 
task's associated mpd process.
bq.  -Readme
bq.    -> Included additional info on how to setup and run MPICH2 1.2 and 
nmpiexec on OS X and Ubuntu/Linux
bq.  
bq.  
bq.  This addresses bug MESOS-183.
bq.      https://issues.apache.org/jira/browse/MESOS-183
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    frameworks/mpi/README.txt cdb4553 
bq.    frameworks/mpi/nmpiexec.py a5db9c0 
bq.    frameworks/mpi/startmpd.py 8eeba5e 
bq.  
bq.  Diff: https://reviews.apache.org/r/4768/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Harvey
bq.  
bq.


                
> Included MPI Framework Fails to Start
> -------------------------------------
>
>                 Key: MESOS-183
>                 URL: https://issues.apache.org/jira/browse/MESOS-183
>             Project: Mesos
>          Issue Type: Bug
>          Components: documentation, framework
>         Environment: Scientific Linux Cluster
>            Reporter: Jessica J
>            Assignee: Harvey Feng 
>            Priority: Blocker
>              Labels: documentation, mpi, setup
>
> There are really two facets to this issue. The first is that no good 
> documentation exists for setting up and using the included MPI framework. The 
> second, and more important issue, is that the framework will not run. The 
> second issue is possibly related to the first in that I may not be setting it 
> up properly. 
> To test the MPI framework, by trial and error I determined I needed to run 
> python setup.py build and python setup.py install in the 
> MESOS-HOME/src/python directory. Now when I try to run nmpiexec -h, I get an 
> AttributeError, below: 
> Traceback (most recent call last):
>   File "./nmpiexec.py", line 2, in <module>
>     import mesos
>   File 
> "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos.py",
>  line 22, in <module>
>     import _mesos
>   File 
> "/usr/lib64/python2.6/site-packages/mesos-0.9.0-py2.6-linux-x86_64.egg/mesos_pb2.py",
>  line 1286, in <module>
>     DESCRIPTOR.message_types_by_name['FrameworkID'] = _FRAMEWORKID
> AttributeError: 'FileDescriptor' object has no attribute 
> 'message_types_by_name'
> I've examined setup.py and determined that the version of protobuf it 
> includes (2.4.1) does, indeed, contain a FileDescriptor class in 
> descriptor.py that sets self.message_types_by_name, so I'm not sure what the 
> issue is. Is this a bug? Or is there a step I'm missing? Do I need to also 
> build/install protobuf?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to