[julia-users] Re: Run Julia job on several workers on a cluster

Florian Oswald Mon, 04 Aug 2014 07:25:00 -0700

Hi Ben,

I'm getting exactly that error message on my cluster. I'm wondering where I 
am supposed to place the sleep(0.5) command? it seems the start_sge_worker 
function doesn't exist anymore (not in base julia anyway.) Would you know 
of any other way to make sure the environment vars are loaded? I always 
thought putting  source ~/.bashrc into my submit script would do that job?


best
florian

On Friday, 28 June 2013 17:04:29 UTC+1, Ben Lauwens wrote:
>
> Hello
>
> I did some debugging and it seems in my case that the environment 
> variables are not set
> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc++.so.
> 6: version `GLIBCXX_3.4.15' not found (required by 
> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so)
> Adding the -V argument to qsub 
> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N JULIA -V 
> -terse -cwd -j y -o $sgedir -t 1:$n`
> and putting 
> sleep(0.5)
> before the success check solves this problem but I get another one. The 
> output 
> stream file reads
> bash: module: line 1: syntax error: unexpected end of file
> bash: error importing function definition for `module'
> julia_worker:9009#192.168.1.226
> The connection info is the last line but will never be read by the 
> function start_sge_workers. Here is a small hack that does the job. 
>
> while !fexists
>     try
>         fl = open(fname)
>         try
>             while !fexists
>                 conninfo = readline(fl)
>                 hostname, port = parse_connection_info(conninfo)
>                 fexists = (hostname != "")
>             end
>         finally
>             close(fl)
>         end
>     catch
>         print(".");
>         sleep(0.5)
>     end
> end
>
> After these modifications, 
>
> addprocs_sge()
> works on a HP cluster running x86_64 GNU/Linux.
> Some feedback from other SGE users should be useful and perhaps this hack 
> can be merged in julia base.
>
> Ben
>
> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote:
>>
>> Thanks for trying this out Kevin. I tried the same after you and got the 
>> same error, although the job was queued:
>>
>> julia> addprocs_sge(2)                                                   
>>                                                                             
>>                    
>> ERROR: assertion failed: ?                                               
>>                                                                             
>>                    
>>  in error at error.jl:22                                                 
>>                                                                             
>>                    
>>  in assert at error.jl:43                                                 
>>                                                                             
>>                   
>>  in success at process.jl:394                                             
>>                                                                             
>>                   
>>  in all at reduce.jl:175                                                 
>>                                                                             
>>                    
>>  in success at process.jl:401                                             
>>                                                                             
>>                   
>>  in start_sge_workers at multi.jl:941                                     
>>                                                                             
>>                   
>>  in addprocs_sge at multi.jl:976
>>
>> $ qstat -u "ucaktpa"
>> job-ID  prior   name       user         state submit/start at     queue   
>>                        slots ja-task-ID 
>>
>> -----------------------------------------------------------------------------------------------------------------
>> 9696992 0.50290 JULIA      ucaktpa      qw    06/16/2013 22:16:14         
>>                            1 1,2
>>
>> I checked the line in multi.jl you mentioned, and was thinking that I 
>> pass several other options to qsub, e.x. in order to allocate memory or set 
>> runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be good to 
>> pass them as extra arguments to start_sge_workers(); alternatively, we 
>> could pass a single argument, which could be a configuration file, similar 
>> to the matlab sample code below:
>>
>> sched = findResource('scheduler', 'configuration', configuration);
>>
>> pjob = createParallelJob(sched);
>>
>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers);
>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers);
>>
>> I will try to trace the addprocs_sge() error message...
>>
>>
>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote:
>>>
>>> The relevant sge line in $JULIA_HOME/base/multi.jl has
>>>
>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N JULIA 
>>> -terse -cwd -j y -o $sgedir -t 1:$n`
>>>
>>> So addprocs_sge() will do the qsub for you.  When I just tried it, the 
>>> workers started okay, but I received an error:
>>>
>>> julia> addprocs_sge(2)
>>> ERROR: assertion failed: ?
>>>  in error at error.jl:22
>>>  in assert at error.jl:43
>>>  in success at process.jl:392
>>>  in map at abstractarray.jl:1478
>>>  in success at process.jl:394
>>>  in start_sge_workers at multi.jl:1009
>>>  in addprocs_sge at multi.jl:1044
>>>
>>> $ qstat -u "kmsquire"
>>> job-ID  prior   name       user         state submit/start at     queue 
>>>                          slots ja-task-ID 
>>>
>>> -----------------------------------------------------------------------------------------------------------------
>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>> [email protected]           1 1
>>>  358164 10.50000 JULIA      kmsquire     r     06/16/2013 14:01:52 
>>> [email protected]           1 2
>>>
>>>
>>> Kevin
>>>
>>>
>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou wrote:
>>>>
>>>> The "--machinefile" option and the blogpost on distributed numerical 
>>>> optimization are potentially excellent sources to help me, thanks a lot. I 
>>>> will try to make use of them and will post here once I make some progress.
>>>>
>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected] wrote:
>>>>>
>>>>> I haven't tried to do what you are describing yet, but I know a 
>>>>> little. In SGE there should be a file named "machinefile" somewhere. It 
>>>>> might be "$TMP/machinefile", but don't quote me. If you have this file, 
>>>>> which contains the hostnames of the nodes, you should be able to pass it 
>>>>> to 
>>>>> julia on startup with the "--machinefile" option. An example of this is 
>>>>> on 
>>>>> the Julia blog 
>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/
>>>>>
>>>>> I hope that helps a little.
>>>>>
>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou wrote:
>>>>>>
>>>>>> I want to run a population MCMC simulation using power density 
>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE. Each 
>>>>>> of 
>>>>>> the 50 nodes realizes a separate MCMC chain. The question generalizes to 
>>>>>> any parallel job which needs to reserve several nodes. I have found two 
>>>>>> relevant posts, namely
>>>>>>
>>>>>>
>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J
>>>>>>
>>>>>> and
>>>>>>
>>>>>>
>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ
>>>>>>
>>>>>> but I haven't found a finalized set of instructions yet to achieve 
>>>>>> the required result.
>>>>>>
>>>>>>
>>>>>>    1. I guess the first step would be to instruct qsub directly to 
>>>>>>    reserve the required number of nodes. Typically, resource 
>>>>>> requirements are 
>>>>>>    specified with the "-l" option. Is "qsub -l nodes=50" the appropriate 
>>>>>>    invocation?
>>>>>>    2. The next question is how the number of processes nprocs() relates 
>>>>>>    to the number of reserved nodes (workers). Should I call the command 
>>>>>>    addprocs(50)? I think this is not good enough because Julia would 
>>>>>>    not know which nodes to select. addprocs({"hostname1", 
>>>>>>    "hostname2",... "hostnam50"}) seems to be the right command. 
>>>>>>    Nevertheless it is not handy to fetch the hostnames of the reserved 
>>>>>> nodes. 
>>>>>>    So the natural question becomes whether addprocs_sge(50) does 
>>>>>>    what I have in mind; does this command require step 1 or it acts as a 
>>>>>>    scheduler?
>>>>>>    3. Does a command such as remotecall(i, simulateMCMC, MCMCargs) run 
>>>>>>    one of the chains on the i-th node while I am on the head node, i.e. 
>>>>>> on the 
>>>>>>    node with myid() equal to 1? I mean, do I regulate communciation 
>>>>>>    in an MPI fashion by being on a so called head node, is this 
>>>>>> possible, or I 
>>>>>>    have to change the implementation?
>>>>>>
>>>>>> Thank you for any help and feedback - if I manage to make the popMCMC 
>>>>>> simulation run on the cluster, I will share the code and "howto".
>>>>>>
>>>>>>

[julia-users] Re: Run Julia job on several workers on a cluster

Reply via email to