[julia-users] Re: Run Julia job on several workers on a cluster

Theodore Papamarkou Thu, 24 Apr 2014 01:42:21 -0700

Hi Thibaut,

It is possible to run jobs on a cluster, but the integration between Julia 
and SGE (and in general with job schedulers) has not reached yet the 
desired state. There are a couple of packages around for this purpose, yet 
more work is needed towards this direction. In theory, it should be 
possible to integrate MPI with Julia, since the parallel computing paradigm 
of Julia has the necessary components that are needed for calling MPI (a 
volunteer must find the time to write a Julia wrapper for the MPI library). 
In the meantime, the parallel environment of Julia is excellent if you want 
it to use it as a stand-alone entity (and if you know the machine 
hostnames, in case you can avoid using SGE or in case you have good IT 
support with sge).


You may have a look at this post too:

https://groups.google.com/forum/#!searchin/julia-dev/papamarkou/julia-dev/fuATZYYAYK8/8_2rsvdyFakJ

I noticed you are based at UCL, is this correct? Unfortunately, my 
experience with the IT support provided by the computer science department 
of UCL has not been ideal with regards to the use of Julia (at least with 
regards to their cluster and the specific support team I had to deal with). 
I had opened a ticket, chased the IT team for 3 months by sending emails on 
a weekly basis, they had me on the waiting list constantly postponing the 
matter, and at the end they simply closed the ticket without bothering to 
look at it! I would say don't waste your time trying to run a Julia job 
submitted via SGE on their clusters, the support is hopeless (furthermore, 
there is not much you can do by yourself since you wouldn't have the admin 
privileges to access the SGE of course). You may try the Legion cluster 
though at UCL, you may have better support and luck with that, if you are 
willing to give it a shot.

Hope this helps,
Theo

On Tuesday, 22 April 2014 01:09:07 UTC+1, Thibaut Lamadon wrote:
>
> Dear all, is there a working example of julia with SGE and MPI?
>
> thank you, 
>
> t.
>
> On Friday, 25 October 2013 12:58:39 UTC+1, Theodore Papamarkou wrote:
>>
>> It took me a while to start using Julia on the departmental cluster, but 
>> I got there. I am aware that there are already some packages for cluster 
>> management, such as PTools, ClusterManagers and MATLABCluster. My preferred 
>> way of dealing with the matter is via the ${PE_HOSTFILE} SGE environment 
>> variable. ${PE_HOSTFILE} is created at qsub's runtime and holds the full 
>> path to the machine file. So, I wrote my own awk script to process this 
>> machine file, whose format may differ depending on the cluster-specific 
>> configuration, so as to get a Julia readable machine file, consisting of 
>> one column with the names of the nodes in it. Then I simply start julia via 
>> "julia --machinefile mymachinefile.txt" from the shell wrapper that I 
>> submit via qsub. I will provide an example on github in case someone likes 
>> this approach.
>>
>> On Saturday, 29 June 2013 08:36:06 UTC+1, Viral Shah wrote:
>>>
>>> Also do see this issue, which talks about abstracting various cluster 
>>> schedulers into a separate package. This should make it easier to support 
>>> more schedulers, and also make it easier to patch these as we go along.
>>>
>>> https://github.com/JuliaLang/julia/issues/3549
>>>
>>> -viral
>>>
>>> On Friday, June 28, 2013 9:40:08 PM UTC+5:30, Theodore Papamarkou wrote:
>>>>
>>>> Many thanks Ben, I noted your hack down to try it out and will get back 
>>>> to you as soon as I do so (which will be rather soon). Anyone else's input 
>>>> is always welcome.
>>>>
>>>> On Friday, June 28, 2013 5:04:29 PM UTC+1, Ben Lauwens wrote:
>>>>>
>>>>> Hello
>>>>>
>>>>> I did some debugging and it seems in my case that the environment 
>>>>> variables are not set
>>>>> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc
>>>>> ++.so.6: version `GLIBCXX_3.4.15' not found (required by 
>>>>> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so)
>>>>> Adding the -V argument to qsub 
>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N 
>>>>> JULIA -V -terse -cwd -j y -o $sgedir -t 1:$n`
>>>>> and putting 
>>>>> sleep(0.5)
>>>>> before the success check solves this problem but I get another one. 
>>>>> The output stream file reads
>>>>> bash: module: line 1: syntax error: unexpected end of file
>>>>> bash: error importing function definition for `module'
>>>>> julia_worker:9009#192.168.1.226
>>>>> The connection info is the last line but will never be read by the 
>>>>> function start_sge_workers. Here is a small hack that does the job. 
>>>>>
>>>>> while !fexists
>>>>>     try
>>>>>         fl = open(fname)
>>>>>         try
>>>>>             while !fexists
>>>>>                 conninfo = readline(fl)
>>>>>                 hostname, port = parse_connection_info(conninfo)
>>>>>                 fexists = (hostname != "")
>>>>>             end
>>>>>         finally
>>>>>             close(fl)
>>>>>         end
>>>>>     catch
>>>>>         print(".");
>>>>>         sleep(0.5)
>>>>>     end
>>>>> end
>>>>>
>>>>> After these modifications, 
>>>>>
>>>>> addprocs_sge()
>>>>> works on a HP cluster running x86_64 GNU/Linux.
>>>>> Some feedback from other SGE users should be useful and perhaps this 
>>>>> hack can be merged in julia base.
>>>>>
>>>>> Ben
>>>>>
>>>>> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote:
>>>>>>
>>>>>> Thanks for trying this out Kevin. I tried the same after you and got 
>>>>>> the same error, although the job was queued:
>>>>>>
>>>>>> julia> addprocs_sge(2)                                               
>>>>>>                                                                          
>>>>>>    
>>>>>>                        
>>>>>> ERROR: assertion failed: ?                                           
>>>>>>                                                                          
>>>>>>    
>>>>>>                        
>>>>>>  in error at error.jl:22                                             
>>>>>>                                                                          
>>>>>>    
>>>>>>                        
>>>>>>  in assert at error.jl:43                                             
>>>>>>                                                                          
>>>>>>    
>>>>>>                       
>>>>>>  in success at process.jl:394                                         
>>>>>>                                                                          
>>>>>>    
>>>>>>                       
>>>>>>  in all at reduce.jl:175                                             
>>>>>>                                                                          
>>>>>>    
>>>>>>                        
>>>>>>  in success at process.jl:401                                         
>>>>>>                                                                          
>>>>>>    
>>>>>>                       
>>>>>>  in start_sge_workers at multi.jl:941                                 
>>>>>>                                                                          
>>>>>>    
>>>>>>                       
>>>>>>  in addprocs_sge at multi.jl:976
>>>>>>
>>>>>> $ qstat -u "ucaktpa"
>>>>>> job-ID  prior   name       user         state submit/start at     
>>>>>> queue                          slots ja-task-ID 
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>> 9696992 0.50290 JULIA      ucaktpa      qw    06/16/2013 22:16:14     
>>>>>>                                1 1,2
>>>>>>
>>>>>> I checked the line in multi.jl you mentioned, and was thinking that I 
>>>>>> pass several other options to qsub, e.x. in order to allocate memory or 
>>>>>> set 
>>>>>> runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be good to 
>>>>>> pass them as extra arguments to start_sge_workers(); alternatively, we 
>>>>>> could pass a single argument, which could be a configuration file, 
>>>>>> similar 
>>>>>> to the matlab sample code below:
>>>>>>
>>>>>> sched = findResource('scheduler', 'configuration', configuration);
>>>>>>
>>>>>> pjob = createParallelJob(sched);
>>>>>>
>>>>>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers);
>>>>>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers);
>>>>>>
>>>>>> I will try to trace the addprocs_sge() error message...
>>>>>>
>>>>>>
>>>>>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote:
>>>>>>>
>>>>>>> The relevant sge line in $JULIA_HOME/base/multi.jl has
>>>>>>>
>>>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N 
>>>>>>> JULIA -terse -cwd -j y -o $sgedir -t 1:$n`
>>>>>>>
>>>>>>> So addprocs_sge() will do the qsub for you.  When I just tried it, 
>>>>>>> the workers started okay, but I received an error:
>>>>>>>
>>>>>>> julia> addprocs_sge(2)
>>>>>>> ERROR: assertion failed: ?
>>>>>>>  in error at error.jl:22
>>>>>>>  in assert at error.jl:43
>>>>>>>  in success at process.jl:392
>>>>>>>  in map at abstractarray.jl:1478
>>>>>>>  in success at process.jl:394
>>>>>>>  in start_sge_workers at multi.jl:1009
>>>>>>>  in addprocs_sge at multi.jl:1044
>>>>>>>
>>>>>>> $ qstat -u "kmsquire"
>>>>>>> job-ID  prior   name       user         state submit/start at     
>>>>>>> queue                          slots ja-task-ID 
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>>>>>> [email protected]           1 1
>>>>>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>>>>>> [email protected]           1 2
>>>>>>>
>>>>>>>
>>>>>>> Kevin
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou wrote:
>>>>>>>>
>>>>>>>> The "--machinefile" option and the blogpost on distributed 
>>>>>>>> numerical optimization are potentially excellent sources to help me, 
>>>>>>>> thanks 
>>>>>>>> a lot. I will try to make use of them and will post here once I make 
>>>>>>>> some 
>>>>>>>> progress.
>>>>>>>>
>>>>>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected] wrote:
>>>>>>>>>
>>>>>>>>> I haven't tried to do what you are describing yet, but I know a 
>>>>>>>>> little. In SGE there should be a file named "machinefile" somewhere. 
>>>>>>>>> It 
>>>>>>>>> might be "$TMP/machinefile", but don't quote me. If you have this 
>>>>>>>>> file, 
>>>>>>>>> which contains the hostnames of the nodes, you should be able to pass 
>>>>>>>>> it to 
>>>>>>>>> julia on startup with the "--machinefile" option. An example of this 
>>>>>>>>> is on 
>>>>>>>>> the Julia blog 
>>>>>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/
>>>>>>>>>
>>>>>>>>> I hope that helps a little.
>>>>>>>>>
>>>>>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I want to run a population MCMC simulation using power density 
>>>>>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE. 
>>>>>>>>>> Each of 
>>>>>>>>>> the 50 nodes realizes a separate MCMC chain. The question 
>>>>>>>>>> generalizes to 
>>>>>>>>>> any parallel job which needs to reserve several nodes. I have found 
>>>>>>>>>> two 
>>>>>>>>>> relevant posts, namely
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J
>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ
>>>>>>>>>>
>>>>>>>>>> but I haven't found a finalized set of instructions yet 
>>>>>>>>>> to achieve the required result.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. I guess the first step would be to instruct qsub directly 
>>>>>>>>>>    to reserve the required number of nodes. Typically, resource 
>>>>>>>>>> requirements 
>>>>>>>>>>    are specified with the "-l" option. Is "qsub -l nodes=50" the 
>>>>>>>>>> appropriate 
>>>>>>>>>>    invocation?
>>>>>>>>>>    2. The next question is how the number of processes nprocs() 
>>>>>>>>>> relates 
>>>>>>>>>>    to the number of reserved nodes (workers). Should I call the 
>>>>>>>>>> command 
>>>>>>>>>>    addprocs(50)? I think this is not good enough because Julia 
>>>>>>>>>>    would not know which nodes to select. addprocs({"hostname1", 
>>>>>>>>>>    "hostname2",... "hostnam50"}) seems to be the right command. 
>>>>>>>>>>    Nevertheless it is not handy to fetch the hostnames of the 
>>>>>>>>>> reserved nodes. 
>>>>>>>>>>    So the natural question becomes whether addprocs_sge(50) does 
>>>>>>>>>>    what I have in mind; does this command require step 1 or it acts 
>>>>>>>>>> as a 
>>>>>>>>>>    scheduler?
>>>>>>>>>>    3. Does a command such as remotecall(i, simulateMCMC, MCMCargs
>>>>>>>>>>    ) run one of the chains on the i-th node while I am on the 
>>>>>>>>>>    head node, i.e. on the node with myid() equal to 1? I mean, 
>>>>>>>>>>    do I regulate communciation in an MPI fashion by being on a so 
>>>>>>>>>> called head 
>>>>>>>>>>    node, is this possible, or I have to change the implementation?
>>>>>>>>>>
>>>>>>>>>> Thank you for any help and feedback - if I manage to make the 
>>>>>>>>>> popMCMC simulation run on the cluster, I will share the code and 
>>>>>>>>>> "howto".
>>>>>>>>>>
>>>>>>>>>>

[julia-users] Re: Run Julia job on several workers on a cluster

Reply via email to