Hi Ben,
I'm getting exactly that error message on my cluster. I'm wondering where I
am supposed to place the sleep(0.5) command? it seems the start_sge_worker
function doesn't exist anymore (not in base julia anyway.) Would you know
of any other way to make sure the environment vars are loaded? I always
thought putting source ~/.bashrc into my submit script would do that job?
best
florian
On Friday, 28 June 2013 17:04:29 UTC+1, Ben Lauwens wrote:
>
> Hello
>
> I did some debugging and it seems in my case that the environment
> variables are not set
> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc++.so.
> 6: version `GLIBCXX_3.4.15' not found (required by
> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so)
> Adding the -V argument to qsub
> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N JULIA -V
> -terse -cwd -j y -o $sgedir -t 1:$n`
> and putting
> sleep(0.5)
> before the success check solves this problem but I get another one. The
> output
> stream file reads
> bash: module: line 1: syntax error: unexpected end of file
> bash: error importing function definition for `module'
> julia_worker:9009#192.168.1.226
> The connection info is the last line but will never be read by the
> function start_sge_workers. Here is a small hack that does the job.
>
> while !fexists
> try
> fl = open(fname)
> try
> while !fexists
> conninfo = readline(fl)
> hostname, port = parse_connection_info(conninfo)
> fexists = (hostname != "")
> end
> finally
> close(fl)
> end
> catch
> print(".");
> sleep(0.5)
> end
> end
>
> After these modifications,
>
> addprocs_sge()
> works on a HP cluster running x86_64 GNU/Linux.
> Some feedback from other SGE users should be useful and perhaps this hack
> can be merged in julia base.
>
> Ben
>
> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote:
>>
>> Thanks for trying this out Kevin. I tried the same after you and got the
>> same error, although the job was queued:
>>
>> julia> addprocs_sge(2)
>>
>>
>> ERROR: assertion failed: ?
>>
>>
>> in error at error.jl:22
>>
>>
>> in assert at error.jl:43
>>
>>
>> in success at process.jl:394
>>
>>
>> in all at reduce.jl:175
>>
>>
>> in success at process.jl:401
>>
>>
>> in start_sge_workers at multi.jl:941
>>
>>
>> in addprocs_sge at multi.jl:976
>>
>> $ qstat -u "ucaktpa"
>> job-ID prior name user state submit/start at queue
>> slots ja-task-ID
>>
>> -----------------------------------------------------------------------------------------------------------------
>> 9696992 0.50290 JULIA ucaktpa qw 06/16/2013 22:16:14
>> 1 1,2
>>
>> I checked the line in multi.jl you mentioned, and was thinking that I
>> pass several other options to qsub, e.x. in order to allocate memory or set
>> runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be good to
>> pass them as extra arguments to start_sge_workers(); alternatively, we
>> could pass a single argument, which could be a configuration file, similar
>> to the matlab sample code below:
>>
>> sched = findResource('scheduler', 'configuration', configuration);
>>
>> pjob = createParallelJob(sched);
>>
>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers);
>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers);
>>
>> I will try to trace the addprocs_sge() error message...
>>
>>
>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote:
>>>
>>> The relevant sge line in $JULIA_HOME/base/multi.jl has
>>>
>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N JULIA
>>> -terse -cwd -j y -o $sgedir -t 1:$n`
>>>
>>> So addprocs_sge() will do the qsub for you. When I just tried it, the
>>> workers started okay, but I received an error:
>>>
>>> julia> addprocs_sge(2)
>>> ERROR: assertion failed: ?
>>> in error at error.jl:22
>>> in assert at error.jl:43
>>> in success at process.jl:392
>>> in map at abstractarray.jl:1478
>>> in success at process.jl:394
>>> in start_sge_workers at multi.jl:1009
>>> in addprocs_sge at multi.jl:1044
>>>
>>> $ qstat -u "kmsquire"
>>> job-ID prior name user state submit/start at queue
>>> slots ja-task-ID
>>>
>>> -----------------------------------------------------------------------------------------------------------------
>>> 358164 10.50000 JULIA kmsquire r 06/16/2013 14:01:52
>>> [email protected] 1 1
>>> 358164 10.50000 JULIA kmsquire r 06/16/2013 14:01:52
>>> [email protected] 1 2
>>>
>>>
>>> Kevin
>>>
>>>
>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou wrote:
>>>>
>>>> The "--machinefile" option and the blogpost on distributed numerical
>>>> optimization are potentially excellent sources to help me, thanks a lot. I
>>>> will try to make use of them and will post here once I make some progress.
>>>>
>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected] wrote:
>>>>>
>>>>> I haven't tried to do what you are describing yet, but I know a
>>>>> little. In SGE there should be a file named "machinefile" somewhere. It
>>>>> might be "$TMP/machinefile", but don't quote me. If you have this file,
>>>>> which contains the hostnames of the nodes, you should be able to pass it
>>>>> to
>>>>> julia on startup with the "--machinefile" option. An example of this is
>>>>> on
>>>>> the Julia blog
>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/
>>>>>
>>>>> I hope that helps a little.
>>>>>
>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou wrote:
>>>>>>
>>>>>> I want to run a population MCMC simulation using power density
>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE. Each
>>>>>> of
>>>>>> the 50 nodes realizes a separate MCMC chain. The question generalizes to
>>>>>> any parallel job which needs to reserve several nodes. I have found two
>>>>>> relevant posts, namely
>>>>>>
>>>>>>
>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J
>>>>>>
>>>>>> and
>>>>>>
>>>>>>
>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ
>>>>>>
>>>>>> but I haven't found a finalized set of instructions yet to achieve
>>>>>> the required result.
>>>>>>
>>>>>>
>>>>>> 1. I guess the first step would be to instruct qsub directly to
>>>>>> reserve the required number of nodes. Typically, resource
>>>>>> requirements are
>>>>>> specified with the "-l" option. Is "qsub -l nodes=50" the appropriate
>>>>>> invocation?
>>>>>> 2. The next question is how the number of processes nprocs() relates
>>>>>> to the number of reserved nodes (workers). Should I call the command
>>>>>> addprocs(50)? I think this is not good enough because Julia would
>>>>>> not know which nodes to select. addprocs({"hostname1",
>>>>>> "hostname2",... "hostnam50"}) seems to be the right command.
>>>>>> Nevertheless it is not handy to fetch the hostnames of the reserved
>>>>>> nodes.
>>>>>> So the natural question becomes whether addprocs_sge(50) does
>>>>>> what I have in mind; does this command require step 1 or it acts as a
>>>>>> scheduler?
>>>>>> 3. Does a command such as remotecall(i, simulateMCMC, MCMCargs) run
>>>>>> one of the chains on the i-th node while I am on the head node, i.e.
>>>>>> on the
>>>>>> node with myid() equal to 1? I mean, do I regulate communciation
>>>>>> in an MPI fashion by being on a so called head node, is this
>>>>>> possible, or I
>>>>>> have to change the implementation?
>>>>>>
>>>>>> Thank you for any help and feedback - if I manage to make the popMCMC
>>>>>> simulation run on the cluster, I will share the code and "howto".
>>>>>>
>>>>>>