Dear all, is there a working example of julia with SGE and MPI?

thank you, 

t.

On Friday, 25 October 2013 12:58:39 UTC+1, Theodore Papamarkou wrote:
>
> It took me a while to start using Julia on the departmental cluster, but I 
> got there. I am aware that there are already some packages for cluster 
> management, such as PTools, ClusterManagers and MATLABCluster. My preferred 
> way of dealing with the matter is via the ${PE_HOSTFILE} SGE environment 
> variable. ${PE_HOSTFILE} is created at qsub's runtime and holds the full 
> path to the machine file. So, I wrote my own awk script to process this 
> machine file, whose format may differ depending on the cluster-specific 
> configuration, so as to get a Julia readable machine file, consisting of 
> one column with the names of the nodes in it. Then I simply start julia via 
> "julia --machinefile mymachinefile.txt" from the shell wrapper that I 
> submit via qsub. I will provide an example on github in case someone likes 
> this approach.
>
> On Saturday, 29 June 2013 08:36:06 UTC+1, Viral Shah wrote:
>>
>> Also do see this issue, which talks about abstracting various cluster 
>> schedulers into a separate package. This should make it easier to support 
>> more schedulers, and also make it easier to patch these as we go along.
>>
>> https://github.com/JuliaLang/julia/issues/3549
>>
>> -viral
>>
>> On Friday, June 28, 2013 9:40:08 PM UTC+5:30, Theodore Papamarkou wrote:
>>>
>>> Many thanks Ben, I noted your hack down to try it out and will get back 
>>> to you as soon as I do so (which will be rather soon). Anyone else's input 
>>> is always welcome.
>>>
>>> On Friday, June 28, 2013 5:04:29 PM UTC+1, Ben Lauwens wrote:
>>>>
>>>> Hello
>>>>
>>>> I did some debugging and it seems in my case that the environment 
>>>> variables are not set
>>>> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc++.
>>>> so.6: version `GLIBCXX_3.4.15' not found (required by 
>>>> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so)
>>>> Adding the -V argument to qsub 
>>>> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N JULIA 
>>>> -V -terse -cwd -j y -o $sgedir -t 1:$n`
>>>> and putting 
>>>> sleep(0.5)
>>>> before the success check solves this problem but I get another one. The
>>>>  output stream file reads
>>>> bash: module: line 1: syntax error: unexpected end of file
>>>> bash: error importing function definition for `module'
>>>> julia_worker:9009#192.168.1.226
>>>> The connection info is the last line but will never be read by the 
>>>> function start_sge_workers. Here is a small hack that does the job. 
>>>>
>>>> while !fexists
>>>>     try
>>>>         fl = open(fname)
>>>>         try
>>>>             while !fexists
>>>>                 conninfo = readline(fl)
>>>>                 hostname, port = parse_connection_info(conninfo)
>>>>                 fexists = (hostname != "")
>>>>             end
>>>>         finally
>>>>             close(fl)
>>>>         end
>>>>     catch
>>>>         print(".");
>>>>         sleep(0.5)
>>>>     end
>>>> end
>>>>
>>>> After these modifications, 
>>>>
>>>> addprocs_sge()
>>>> works on a HP cluster running x86_64 GNU/Linux.
>>>> Some feedback from other SGE users should be useful and perhaps this 
>>>> hack can be merged in julia base.
>>>>
>>>> Ben
>>>>
>>>> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote:
>>>>>
>>>>> Thanks for trying this out Kevin. I tried the same after you and got 
>>>>> the same error, although the job was queued:
>>>>>
>>>>> julia> addprocs_sge(2)                                                 
>>>>>                                                                           
>>>>>   
>>>>>                      
>>>>> ERROR: assertion failed: ?                                             
>>>>>                                                                           
>>>>>   
>>>>>                      
>>>>>  in error at error.jl:22                                               
>>>>>                                                                           
>>>>>   
>>>>>                      
>>>>>  in assert at error.jl:43                                             
>>>>>                                                                           
>>>>>   
>>>>>                       
>>>>>  in success at process.jl:394                                         
>>>>>                                                                           
>>>>>   
>>>>>                       
>>>>>  in all at reduce.jl:175                                               
>>>>>                                                                           
>>>>>   
>>>>>                      
>>>>>  in success at process.jl:401                                         
>>>>>                                                                           
>>>>>   
>>>>>                       
>>>>>  in start_sge_workers at multi.jl:941                                 
>>>>>                                                                           
>>>>>   
>>>>>                       
>>>>>  in addprocs_sge at multi.jl:976
>>>>>
>>>>> $ qstat -u "ucaktpa"
>>>>> job-ID  prior   name       user         state submit/start at     
>>>>> queue                          slots ja-task-ID 
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>> 9696992 0.50290 JULIA      ucaktpa      qw    06/16/2013 22:16:14     
>>>>>                                1 1,2
>>>>>
>>>>> I checked the line in multi.jl you mentioned, and was thinking that I 
>>>>> pass several other options to qsub, e.x. in order to allocate memory or 
>>>>> set 
>>>>> runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be good to 
>>>>> pass them as extra arguments to start_sge_workers(); alternatively, we 
>>>>> could pass a single argument, which could be a configuration file, 
>>>>> similar 
>>>>> to the matlab sample code below:
>>>>>
>>>>> sched = findResource('scheduler', 'configuration', configuration);
>>>>>
>>>>> pjob = createParallelJob(sched);
>>>>>
>>>>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers);
>>>>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers);
>>>>>
>>>>> I will try to trace the addprocs_sge() error message...
>>>>>
>>>>>
>>>>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote:
>>>>>>
>>>>>> The relevant sge line in $JULIA_HOME/base/multi.jl has
>>>>>>
>>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N 
>>>>>> JULIA -terse -cwd -j y -o $sgedir -t 1:$n`
>>>>>>
>>>>>> So addprocs_sge() will do the qsub for you.  When I just tried it, 
>>>>>> the workers started okay, but I received an error:
>>>>>>
>>>>>> julia> addprocs_sge(2)
>>>>>> ERROR: assertion failed: ?
>>>>>>  in error at error.jl:22
>>>>>>  in assert at error.jl:43
>>>>>>  in success at process.jl:392
>>>>>>  in map at abstractarray.jl:1478
>>>>>>  in success at process.jl:394
>>>>>>  in start_sge_workers at multi.jl:1009
>>>>>>  in addprocs_sge at multi.jl:1044
>>>>>>
>>>>>> $ qstat -u "kmsquire"
>>>>>> job-ID  prior   name       user         state submit/start at     
>>>>>> queue                          slots ja-task-ID 
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>>>>> [email protected]           1 1
>>>>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>>>>> [email protected]           1 2
>>>>>>
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou wrote:
>>>>>>>
>>>>>>> The "--machinefile" option and the blogpost on distributed numerical 
>>>>>>> optimization are potentially excellent sources to help me, thanks a 
>>>>>>> lot. I 
>>>>>>> will try to make use of them and will post here once I make some 
>>>>>>> progress.
>>>>>>>
>>>>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected] wrote:
>>>>>>>>
>>>>>>>> I haven't tried to do what you are describing yet, but I know a 
>>>>>>>> little. In SGE there should be a file named "machinefile" somewhere. 
>>>>>>>> It 
>>>>>>>> might be "$TMP/machinefile", but don't quote me. If you have this 
>>>>>>>> file, 
>>>>>>>> which contains the hostnames of the nodes, you should be able to pass 
>>>>>>>> it to 
>>>>>>>> julia on startup with the "--machinefile" option. An example of this 
>>>>>>>> is on 
>>>>>>>> the Julia blog 
>>>>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/
>>>>>>>>
>>>>>>>> I hope that helps a little.
>>>>>>>>
>>>>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I want to run a population MCMC simulation using power density 
>>>>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE. 
>>>>>>>>> Each of 
>>>>>>>>> the 50 nodes realizes a separate MCMC chain. The question generalizes 
>>>>>>>>> to 
>>>>>>>>> any parallel job which needs to reserve several nodes. I have found 
>>>>>>>>> two 
>>>>>>>>> relevant posts, namely
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ
>>>>>>>>>
>>>>>>>>> but I haven't found a finalized set of instructions yet to achieve 
>>>>>>>>> the required result.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. I guess the first step would be to instruct qsub directly 
>>>>>>>>>    to reserve the required number of nodes. Typically, resource 
>>>>>>>>> requirements 
>>>>>>>>>    are specified with the "-l" option. Is "qsub -l nodes=50" the 
>>>>>>>>> appropriate 
>>>>>>>>>    invocation?
>>>>>>>>>    2. The next question is how the number of processes nprocs() 
>>>>>>>>> relates 
>>>>>>>>>    to the number of reserved nodes (workers). Should I call the 
>>>>>>>>> command 
>>>>>>>>>    addprocs(50)? I think this is not good enough because Julia 
>>>>>>>>>    would not know which nodes to select. addprocs({"hostname1", 
>>>>>>>>>    "hostname2",... "hostnam50"}) seems to be the right command. 
>>>>>>>>>    Nevertheless it is not handy to fetch the hostnames of the 
>>>>>>>>> reserved nodes. 
>>>>>>>>>    So the natural question becomes whether addprocs_sge(50) does 
>>>>>>>>>    what I have in mind; does this command require step 1 or it acts 
>>>>>>>>> as a 
>>>>>>>>>    scheduler?
>>>>>>>>>    3. Does a command such as remotecall(i, simulateMCMC, MCMCargs
>>>>>>>>>    ) run one of the chains on the i-th node while I am on the 
>>>>>>>>>    head node, i.e. on the node with myid() equal to 1? I mean, do 
>>>>>>>>>    I regulate communciation in an MPI fashion by being on a so called 
>>>>>>>>> head 
>>>>>>>>>    node, is this possible, or I have to change the implementation?
>>>>>>>>>
>>>>>>>>> Thank you for any help and feedback - if I manage to make the 
>>>>>>>>> popMCMC simulation run on the cluster, I will share the code and 
>>>>>>>>> "howto".
>>>>>>>>>
>>>>>>>>>

Reply via email to