[julia-users] Re: Run Julia job on several workers on a cluster

Thibaut Lamadon Thu, 24 Apr 2014 05:18:33 -0700

Thank you very much for this answer. Actually I was able to compile Julia 
on the economics cluster, and run a parallel command using


ClusterManagers.addprocs_sge. This is already great, but now I want to be 
able to submit jobs to the main scheduler. 

I am going to look into extracting the machine file given by the scheduler 
and start from there. 

Concerning UCL I am leaving soon, but I think it would be nice for them to 
get it up and running on legion. 

Thank you for your response, and I will update this thread if I am able to 
write up a good submission script.
very best,

t.

On Thursday, 24 April 2014 09:41:57 UTC+1, Theodore Papamarkou wrote:
>
> Hi Thibaut,
>
> It is possible to run jobs on a cluster, but the integration between Julia 
> and SGE (and in general with job schedulers) has not reached yet the 
> desired state. There are a couple of packages around for this purpose, yet 
> more work is needed towards this direction. In theory, it should be 
> possible to integrate MPI with Julia, since the parallel computing paradigm 
> of Julia has the necessary components that are needed for calling MPI (a 
> volunteer must find the time to write a Julia wrapper for the MPI library). 
> In the meantime, the parallel environment of Julia is excellent if you want 
> it to use it as a stand-alone entity (and if you know the machine 
> hostnames, in case you can avoid using SGE or in case you have good IT 
> support with sge).
>
> You may have a look at this post too:
>
>
> https://groups.google.com/forum/#!searchin/julia-dev/papamarkou/julia-dev/fuATZYYAYK8/8_2rsvdyFakJ
>
> I noticed you are based at UCL, is this correct? Unfortunately, my 
> experience with the IT support provided by the computer science department 
> of UCL has not been ideal with regards to the use of Julia (at least with 
> regards to their cluster and the specific support team I had to deal with). 
> I had opened a ticket, chased the IT team for 3 months by sending emails on 
> a weekly basis, they had me on the waiting list constantly postponing the 
> matter, and at the end they simply closed the ticket without bothering to 
> look at it! I would say don't waste your time trying to run a Julia job 
> submitted via SGE on their clusters, the support is hopeless (furthermore, 
> there is not much you can do by yourself since you wouldn't have the admin 
> privileges to access the SGE of course). You may try the Legion cluster 
> though at UCL, you may have better support and luck with that, if you are 
> willing to give it a shot.
>
> Hope this helps,
> Theo
>
> On Tuesday, 22 April 2014 01:09:07 UTC+1, Thibaut Lamadon wrote:
>>
>> Dear all, is there a working example of julia with SGE and MPI?
>>
>> thank you, 
>>
>> t.
>>
>> On Friday, 25 October 2013 12:58:39 UTC+1, Theodore Papamarkou wrote:
>>>
>>> It took me a while to start using Julia on the departmental cluster, but 
>>> I got there. I am aware that there are already some packages for cluster 
>>> management, such as PTools, ClusterManagers and MATLABCluster. My preferred 
>>> way of dealing with the matter is via the ${PE_HOSTFILE} SGE environment 
>>> variable. ${PE_HOSTFILE} is created at qsub's runtime and holds the full 
>>> path to the machine file. So, I wrote my own awk script to process this 
>>> machine file, whose format may differ depending on the cluster-specific 
>>> configuration, so as to get a Julia readable machine file, consisting of 
>>> one column with the names of the nodes in it. Then I simply start julia via 
>>> "julia --machinefile mymachinefile.txt" from the shell wrapper that I 
>>> submit via qsub. I will provide an example on github in case someone likes 
>>> this approach.
>>>
>>> On Saturday, 29 June 2013 08:36:06 UTC+1, Viral Shah wrote:
>>>>
>>>> Also do see this issue, which talks about abstracting various cluster 
>>>> schedulers into a separate package. This should make it easier to support 
>>>> more schedulers, and also make it easier to patch these as we go along.
>>>>
>>>> https://github.com/JuliaLang/julia/issues/3549
>>>>
>>>> -viral
>>>>
>>>> On Friday, June 28, 2013 9:40:08 PM UTC+5:30, Theodore Papamarkou wrote:
>>>>>
>>>>> Many thanks Ben, I noted your hack down to try it out and will get 
>>>>> back to you as soon as I do so (which will be rather soon). Anyone else's 
>>>>> input is always welcome.
>>>>>
>>>>> On Friday, June 28, 2013 5:04:29 PM UTC+1, Ben Lauwens wrote:
>>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I did some debugging and it seems in my case that the environment 
>>>>>> variables are not set
>>>>>> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc
>>>>>> ++.so.6: version `GLIBCXX_3.4.15' not found (required by 
>>>>>> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so)
>>>>>> Adding the -V argument to qsub 
>>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N 
>>>>>> JULIA -V -terse -cwd -j y -o $sgedir -t 1:$n`
>>>>>> and putting 
>>>>>> sleep(0.5)
>>>>>> before the success check solves this problem but I get another one. 
>>>>>> The output stream file reads
>>>>>> bash: module: line 1: syntax error: unexpected end of file
>>>>>> bash: error importing function definition for `module'
>>>>>> julia_worker:9009#192.168.1.226
>>>>>> The connection info is the last line but will never be read by the 
>>>>>> function start_sge_workers. Here is a small hack that does the job. 
>>>>>>
>>>>>> while !fexists
>>>>>>     try
>>>>>>         fl = open(fname)
>>>>>>         try
>>>>>>             while !fexists
>>>>>>                 conninfo = readline(fl)
>>>>>>                 hostname, port = parse_connection_info(conninfo)
>>>>>>                 fexists = (hostname != "")
>>>>>>             end
>>>>>>         finally
>>>>>>             close(fl)
>>>>>>         end
>>>>>>     catch
>>>>>>         print(".");
>>>>>>         sleep(0.5)
>>>>>>     end
>>>>>> end
>>>>>>
>>>>>> After these modifications, 
>>>>>>
>>>>>> addprocs_sge()
>>>>>> works on a HP cluster running x86_64 GNU/Linux.
>>>>>> Some feedback from other SGE users should be useful and perhaps this 
>>>>>> hack can be merged in julia base.
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote:
>>>>>>>
>>>>>>> Thanks for trying this out Kevin. I tried the same after you and got 
>>>>>>> the same error, although the job was queued:
>>>>>>>
>>>>>>> julia> addprocs_sge(2)                                               
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                        
>>>>>>> ERROR: assertion failed: ?                                           
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                        
>>>>>>>  in error at error.jl:22                                             
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                        
>>>>>>>  in assert at error.jl:43                                           
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                         
>>>>>>>  in success at process.jl:394                                       
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                         
>>>>>>>  in all at reduce.jl:175                                             
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                        
>>>>>>>  in success at process.jl:401                                       
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                         
>>>>>>>  in start_sge_workers at multi.jl:941                               
>>>>>>>                                                                         
>>>>>>>     
>>>>>>>                         
>>>>>>>  in addprocs_sge at multi.jl:976
>>>>>>>
>>>>>>> $ qstat -u "ucaktpa"
>>>>>>> job-ID  prior   name       user         state submit/start at     
>>>>>>> queue                          slots ja-task-ID 
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>>> 9696992 0.50290 JULIA      ucaktpa      qw    06/16/2013 22:16:14   
>>>>>>>                                  1 1,2
>>>>>>>
>>>>>>> I checked the line in multi.jl you mentioned, and was thinking that 
>>>>>>> I pass several other options to qsub, e.x. in order to allocate memory 
>>>>>>> or 
>>>>>>> set runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be 
>>>>>>> good 
>>>>>>> to pass them as extra arguments to start_sge_workers(); alternatively, 
>>>>>>> we 
>>>>>>> could pass a single argument, which could be a configuration file, 
>>>>>>> similar 
>>>>>>> to the matlab sample code below:
>>>>>>>
>>>>>>> sched = findResource('scheduler', 'configuration', configuration);
>>>>>>>
>>>>>>> pjob = createParallelJob(sched);
>>>>>>>
>>>>>>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers);
>>>>>>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers);
>>>>>>>
>>>>>>> I will try to trace the addprocs_sge() error message...
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote:
>>>>>>>>
>>>>>>>> The relevant sge line in $JULIA_HOME/base/multi.jl has
>>>>>>>>
>>>>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N 
>>>>>>>> JULIA -terse -cwd -j y -o $sgedir -t 1:$n`
>>>>>>>>
>>>>>>>> So addprocs_sge() will do the qsub for you.  When I just tried it, 
>>>>>>>> the workers started okay, but I received an error:
>>>>>>>>
>>>>>>>> julia> addprocs_sge(2)
>>>>>>>> ERROR: assertion failed: ?
>>>>>>>>  in error at error.jl:22
>>>>>>>>  in assert at error.jl:43
>>>>>>>>  in success at process.jl:392
>>>>>>>>  in map at abstractarray.jl:1478
>>>>>>>>  in success at process.jl:394
>>>>>>>>  in start_sge_workers at multi.jl:1009
>>>>>>>>  in addprocs_sge at multi.jl:1044
>>>>>>>>
>>>>>>>> $ qstat -u "kmsquire"
>>>>>>>> job-ID  prior   name       user         state submit/start at     
>>>>>>>> queue                          slots ja-task-ID 
>>>>>>>>
>>>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>>>>>>> [email protected]           1 1
>>>>>>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>>>>>>> [email protected]           1 2
>>>>>>>>
>>>>>>>>
>>>>>>>> Kevin
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> The "--machinefile" option and the blogpost on distributed 
>>>>>>>>> numerical optimization are potentially excellent sources to help me, 
>>>>>>>>> thanks 
>>>>>>>>> a lot. I will try to make use of them and will post here once I make 
>>>>>>>>> some 
>>>>>>>>> progress.
>>>>>>>>>
>>>>>>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected]:
>>>>>>>>>>
>>>>>>>>>> I haven't tried to do what you are describing yet, but I know a 
>>>>>>>>>> little. In SGE there should be a file named "machinefile" somewhere. 
>>>>>>>>>> It 
>>>>>>>>>> might be "$TMP/machinefile", but don't quote me. If you have this 
>>>>>>>>>> file, 
>>>>>>>>>> which contains the hostnames of the nodes, you should be able to 
>>>>>>>>>> pass it to 
>>>>>>>>>> julia on startup with the "--machinefile" option. An example of this 
>>>>>>>>>> is on 
>>>>>>>>>> the Julia blog 
>>>>>>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/
>>>>>>>>>>
>>>>>>>>>> I hope that helps a little.
>>>>>>>>>>
>>>>>>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I want to run a population MCMC simulation using power density 
>>>>>>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE. 
>>>>>>>>>>> Each of 
>>>>>>>>>>> the 50 nodes realizes a separate MCMC chain. The question 
>>>>>>>>>>> generalizes to 
>>>>>>>>>>> any parallel job which needs to reserve several nodes. I have found 
>>>>>>>>>>> two 
>>>>>>>>>>> relevant posts, namely
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ
>>>>>>>>>>>
>>>>>>>>>>> but I haven't found a finalized set of instructions yet 
>>>>>>>>>>> to achieve the required result.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    1. I guess the first step would be to instruct qsub directly 
>>>>>>>>>>>    to reserve the required number of nodes. Typically, resource 
>>>>>>>>>>> requirements 
>>>>>>>>>>>    are specified with the "-l" option. Is "qsub -l nodes=50" the 
>>>>>>>>>>> appropriate 
>>>>>>>>>>>    invocation?
>>>>>>>>>>>    2. The next question is how the number of processes nprocs() 
>>>>>>>>>>> relates 
>>>>>>>>>>>    to the number of reserved nodes (workers). Should I call the 
>>>>>>>>>>> command 
>>>>>>>>>>>    addprocs(50)? I think this is not good enough because Julia 
>>>>>>>>>>>    would not know which nodes to select. addprocs({"hostname1", 
>>>>>>>>>>>    "hostname2",... "hostnam50"}) seems to be the right command. 
>>>>>>>>>>>    Nevertheless it is not handy to fetch the hostnames of the 
>>>>>>>>>>> reserved nodes. 
>>>>>>>>>>>    So the natural question becomes whether addprocs_sge(50) does 
>>>>>>>>>>>    what I have in mind; does this command require step 1 or it acts 
>>>>>>>>>>> as a 
>>>>>>>>>>>    scheduler?
>>>>>>>>>>>    3. Does a command such as remotecall(i, simulateMCMC, 
>>>>>>>>>>>    MCMCargs) run one of the chains on the i-th node while I am 
>>>>>>>>>>>    on the head node, i.e. on the node with myid() equal to 1? I 
>>>>>>>>>>>    mean, do I regulate communciation in an MPI fashion by being on 
>>>>>>>>>>> a so called 
>>>>>>>>>>>    head node, is this possible, or I have to change the 
>>>>>>>>>>> implementation?
>>>>>>>>>>>
>>>>>>>>>>> Thank you for any help and feedback - if I manage to make the 
>>>>>>>>>>> popMCMC simulation run on the cluster, I will share the code and 
>>>>>>>>>>> "howto".
>>>>>>>>>>>
>>>>>>>>>>>

[julia-users] Re: Run Julia job on several workers on a cluster

Reply via email to