Dear all, is there a working example of julia with SGE and MPI?
thank you,
t.
On Friday, 25 October 2013 12:58:39 UTC+1, Theodore Papamarkou wrote:
>
> It took me a while to start using Julia on the departmental cluster, but I
> got there. I am aware that there are already some packages for cluster
> management, such as PTools, ClusterManagers and MATLABCluster. My preferred
> way of dealing with the matter is via the ${PE_HOSTFILE} SGE environment
> variable. ${PE_HOSTFILE} is created at qsub's runtime and holds the full
> path to the machine file. So, I wrote my own awk script to process this
> machine file, whose format may differ depending on the cluster-specific
> configuration, so as to get a Julia readable machine file, consisting of
> one column with the names of the nodes in it. Then I simply start julia via
> "julia --machinefile mymachinefile.txt" from the shell wrapper that I
> submit via qsub. I will provide an example on github in case someone likes
> this approach.
>
> On Saturday, 29 June 2013 08:36:06 UTC+1, Viral Shah wrote:
>>
>> Also do see this issue, which talks about abstracting various cluster
>> schedulers into a separate package. This should make it easier to support
>> more schedulers, and also make it easier to patch these as we go along.
>>
>> https://github.com/JuliaLang/julia/issues/3549
>>
>> -viral
>>
>> On Friday, June 28, 2013 9:40:08 PM UTC+5:30, Theodore Papamarkou wrote:
>>>
>>> Many thanks Ben, I noted your hack down to try it out and will get back
>>> to you as soon as I do so (which will be rather soon). Anyone else's input
>>> is always welcome.
>>>
>>> On Friday, June 28, 2013 5:04:29 PM UTC+1, Ben Lauwens wrote:
>>>>
>>>> Hello
>>>>
>>>> I did some debugging and it seems in my case that the environment
>>>> variables are not set
>>>> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc++.
>>>> so.6: version `GLIBCXX_3.4.15' not found (required by
>>>> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so)
>>>> Adding the -V argument to qsub
>>>> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N JULIA
>>>> -V -terse -cwd -j y -o $sgedir -t 1:$n`
>>>> and putting
>>>> sleep(0.5)
>>>> before the success check solves this problem but I get another one. The
>>>> output stream file reads
>>>> bash: module: line 1: syntax error: unexpected end of file
>>>> bash: error importing function definition for `module'
>>>> julia_worker:9009#192.168.1.226
>>>> The connection info is the last line but will never be read by the
>>>> function start_sge_workers. Here is a small hack that does the job.
>>>>
>>>> while !fexists
>>>> try
>>>> fl = open(fname)
>>>> try
>>>> while !fexists
>>>> conninfo = readline(fl)
>>>> hostname, port = parse_connection_info(conninfo)
>>>> fexists = (hostname != "")
>>>> end
>>>> finally
>>>> close(fl)
>>>> end
>>>> catch
>>>> print(".");
>>>> sleep(0.5)
>>>> end
>>>> end
>>>>
>>>> After these modifications,
>>>>
>>>> addprocs_sge()
>>>> works on a HP cluster running x86_64 GNU/Linux.
>>>> Some feedback from other SGE users should be useful and perhaps this
>>>> hack can be merged in julia base.
>>>>
>>>> Ben
>>>>
>>>> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote:
>>>>>
>>>>> Thanks for trying this out Kevin. I tried the same after you and got
>>>>> the same error, although the job was queued:
>>>>>
>>>>> julia> addprocs_sge(2)
>>>>>
>>>>>
>>>>>
>>>>> ERROR: assertion failed: ?
>>>>>
>>>>>
>>>>>
>>>>> in error at error.jl:22
>>>>>
>>>>>
>>>>>
>>>>> in assert at error.jl:43
>>>>>
>>>>>
>>>>>
>>>>> in success at process.jl:394
>>>>>
>>>>>
>>>>>
>>>>> in all at reduce.jl:175
>>>>>
>>>>>
>>>>>
>>>>> in success at process.jl:401
>>>>>
>>>>>
>>>>>
>>>>> in start_sge_workers at multi.jl:941
>>>>>
>>>>>
>>>>>
>>>>> in addprocs_sge at multi.jl:976
>>>>>
>>>>> $ qstat -u "ucaktpa"
>>>>> job-ID prior name user state submit/start at
>>>>> queue slots ja-task-ID
>>>>>
>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>> 9696992 0.50290 JULIA ucaktpa qw 06/16/2013 22:16:14
>>>>> 1 1,2
>>>>>
>>>>> I checked the line in multi.jl you mentioned, and was thinking that I
>>>>> pass several other options to qsub, e.x. in order to allocate memory or
>>>>> set
>>>>> runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be good to
>>>>> pass them as extra arguments to start_sge_workers(); alternatively, we
>>>>> could pass a single argument, which could be a configuration file,
>>>>> similar
>>>>> to the matlab sample code below:
>>>>>
>>>>> sched = findResource('scheduler', 'configuration', configuration);
>>>>>
>>>>> pjob = createParallelJob(sched);
>>>>>
>>>>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers);
>>>>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers);
>>>>>
>>>>> I will try to trace the addprocs_sge() error message...
>>>>>
>>>>>
>>>>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote:
>>>>>>
>>>>>> The relevant sge line in $JULIA_HOME/base/multi.jl has
>>>>>>
>>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N
>>>>>> JULIA -terse -cwd -j y -o $sgedir -t 1:$n`
>>>>>>
>>>>>> So addprocs_sge() will do the qsub for you. When I just tried it,
>>>>>> the workers started okay, but I received an error:
>>>>>>
>>>>>> julia> addprocs_sge(2)
>>>>>> ERROR: assertion failed: ?
>>>>>> in error at error.jl:22
>>>>>> in assert at error.jl:43
>>>>>> in success at process.jl:392
>>>>>> in map at abstractarray.jl:1478
>>>>>> in success at process.jl:394
>>>>>> in start_sge_workers at multi.jl:1009
>>>>>> in addprocs_sge at multi.jl:1044
>>>>>>
>>>>>> $ qstat -u "kmsquire"
>>>>>> job-ID prior name user state submit/start at
>>>>>> queue slots ja-task-ID
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------------------
>>>>>> 358164 10.50000 JULIA kmsquire r 06/16/2013 14:01:52
>>>>>> [email protected] 1 1
>>>>>> 358164 10.50000 JULIA kmsquire r 06/16/2013 14:01:52
>>>>>> [email protected] 1 2
>>>>>>
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou wrote:
>>>>>>>
>>>>>>> The "--machinefile" option and the blogpost on distributed numerical
>>>>>>> optimization are potentially excellent sources to help me, thanks a
>>>>>>> lot. I
>>>>>>> will try to make use of them and will post here once I make some
>>>>>>> progress.
>>>>>>>
>>>>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected] wrote:
>>>>>>>>
>>>>>>>> I haven't tried to do what you are describing yet, but I know a
>>>>>>>> little. In SGE there should be a file named "machinefile" somewhere.
>>>>>>>> It
>>>>>>>> might be "$TMP/machinefile", but don't quote me. If you have this
>>>>>>>> file,
>>>>>>>> which contains the hostnames of the nodes, you should be able to pass
>>>>>>>> it to
>>>>>>>> julia on startup with the "--machinefile" option. An example of this
>>>>>>>> is on
>>>>>>>> the Julia blog
>>>>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/
>>>>>>>>
>>>>>>>> I hope that helps a little.
>>>>>>>>
>>>>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I want to run a population MCMC simulation using power density
>>>>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE.
>>>>>>>>> Each of
>>>>>>>>> the 50 nodes realizes a separate MCMC chain. The question generalizes
>>>>>>>>> to
>>>>>>>>> any parallel job which needs to reserve several nodes. I have found
>>>>>>>>> two
>>>>>>>>> relevant posts, namely
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ
>>>>>>>>>
>>>>>>>>> but I haven't found a finalized set of instructions yet to achieve
>>>>>>>>> the required result.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. I guess the first step would be to instruct qsub directly
>>>>>>>>> to reserve the required number of nodes. Typically, resource
>>>>>>>>> requirements
>>>>>>>>> are specified with the "-l" option. Is "qsub -l nodes=50" the
>>>>>>>>> appropriate
>>>>>>>>> invocation?
>>>>>>>>> 2. The next question is how the number of processes nprocs()
>>>>>>>>> relates
>>>>>>>>> to the number of reserved nodes (workers). Should I call the
>>>>>>>>> command
>>>>>>>>> addprocs(50)? I think this is not good enough because Julia
>>>>>>>>> would not know which nodes to select. addprocs({"hostname1",
>>>>>>>>> "hostname2",... "hostnam50"}) seems to be the right command.
>>>>>>>>> Nevertheless it is not handy to fetch the hostnames of the
>>>>>>>>> reserved nodes.
>>>>>>>>> So the natural question becomes whether addprocs_sge(50) does
>>>>>>>>> what I have in mind; does this command require step 1 or it acts
>>>>>>>>> as a
>>>>>>>>> scheduler?
>>>>>>>>> 3. Does a command such as remotecall(i, simulateMCMC, MCMCargs
>>>>>>>>> ) run one of the chains on the i-th node while I am on the
>>>>>>>>> head node, i.e. on the node with myid() equal to 1? I mean, do
>>>>>>>>> I regulate communciation in an MPI fashion by being on a so called
>>>>>>>>> head
>>>>>>>>> node, is this possible, or I have to change the implementation?
>>>>>>>>>
>>>>>>>>> Thank you for any help and feedback - if I manage to make the
>>>>>>>>> popMCMC simulation run on the cluster, I will share the code and
>>>>>>>>> "howto".
>>>>>>>>>
>>>>>>>>>