Thank you very much for this answer. Actually I was able to compile Julia on the economics cluster, and run a parallel command using
ClusterManagers.addprocs_sge. This is already great, but now I want to be able to submit jobs to the main scheduler. I am going to look into extracting the machine file given by the scheduler and start from there. Concerning UCL I am leaving soon, but I think it would be nice for them to get it up and running on legion. Thank you for your response, and I will update this thread if I am able to write up a good submission script. very best, t. On Thursday, 24 April 2014 09:41:57 UTC+1, Theodore Papamarkou wrote: > > Hi Thibaut, > > It is possible to run jobs on a cluster, but the integration between Julia > and SGE (and in general with job schedulers) has not reached yet the > desired state. There are a couple of packages around for this purpose, yet > more work is needed towards this direction. In theory, it should be > possible to integrate MPI with Julia, since the parallel computing paradigm > of Julia has the necessary components that are needed for calling MPI (a > volunteer must find the time to write a Julia wrapper for the MPI library). > In the meantime, the parallel environment of Julia is excellent if you want > it to use it as a stand-alone entity (and if you know the machine > hostnames, in case you can avoid using SGE or in case you have good IT > support with sge). > > You may have a look at this post too: > > > https://groups.google.com/forum/#!searchin/julia-dev/papamarkou/julia-dev/fuATZYYAYK8/8_2rsvdyFakJ > > I noticed you are based at UCL, is this correct? Unfortunately, my > experience with the IT support provided by the computer science department > of UCL has not been ideal with regards to the use of Julia (at least with > regards to their cluster and the specific support team I had to deal with). > I had opened a ticket, chased the IT team for 3 months by sending emails on > a weekly basis, they had me on the waiting list constantly postponing the > matter, and at the end they simply closed the ticket without bothering to > look at it! I would say don't waste your time trying to run a Julia job > submitted via SGE on their clusters, the support is hopeless (furthermore, > there is not much you can do by yourself since you wouldn't have the admin > privileges to access the SGE of course). You may try the Legion cluster > though at UCL, you may have better support and luck with that, if you are > willing to give it a shot. > > Hope this helps, > Theo > > On Tuesday, 22 April 2014 01:09:07 UTC+1, Thibaut Lamadon wrote: >> >> Dear all, is there a working example of julia with SGE and MPI? >> >> thank you, >> >> t. >> >> On Friday, 25 October 2013 12:58:39 UTC+1, Theodore Papamarkou wrote: >>> >>> It took me a while to start using Julia on the departmental cluster, but >>> I got there. I am aware that there are already some packages for cluster >>> management, such as PTools, ClusterManagers and MATLABCluster. My preferred >>> way of dealing with the matter is via the ${PE_HOSTFILE} SGE environment >>> variable. ${PE_HOSTFILE} is created at qsub's runtime and holds the full >>> path to the machine file. So, I wrote my own awk script to process this >>> machine file, whose format may differ depending on the cluster-specific >>> configuration, so as to get a Julia readable machine file, consisting of >>> one column with the names of the nodes in it. Then I simply start julia via >>> "julia --machinefile mymachinefile.txt" from the shell wrapper that I >>> submit via qsub. I will provide an example on github in case someone likes >>> this approach. >>> >>> On Saturday, 29 June 2013 08:36:06 UTC+1, Viral Shah wrote: >>>> >>>> Also do see this issue, which talks about abstracting various cluster >>>> schedulers into a separate package. This should make it easier to support >>>> more schedulers, and also make it easier to patch these as we go along. >>>> >>>> https://github.com/JuliaLang/julia/issues/3549 >>>> >>>> -viral >>>> >>>> On Friday, June 28, 2013 9:40:08 PM UTC+5:30, Theodore Papamarkou wrote: >>>>> >>>>> Many thanks Ben, I noted your hack down to try it out and will get >>>>> back to you as soon as I do so (which will be rather soon). Anyone else's >>>>> input is always welcome. >>>>> >>>>> On Friday, June 28, 2013 5:04:29 PM UTC+1, Ben Lauwens wrote: >>>>>> >>>>>> Hello >>>>>> >>>>>> I did some debugging and it seems in my case that the environment >>>>>> variables are not set >>>>>> /home/blauwens/julia/usr/bin/julia-release-basic: /usr/lib64/libstdc >>>>>> ++.so.6: version `GLIBCXX_3.4.15' not found (required by >>>>>> /home/blauwens/julia/usr/bin/../lib/libjulia-release.so) >>>>>> Adding the -V argument to qsub >>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` |> `qsub -N >>>>>> JULIA -V -terse -cwd -j y -o $sgedir -t 1:$n` >>>>>> and putting >>>>>> sleep(0.5) >>>>>> before the success check solves this problem but I get another one. >>>>>> The output stream file reads >>>>>> bash: module: line 1: syntax error: unexpected end of file >>>>>> bash: error importing function definition for `module' >>>>>> julia_worker:9009#192.168.1.226 >>>>>> The connection info is the last line but will never be read by the >>>>>> function start_sge_workers. Here is a small hack that does the job. >>>>>> >>>>>> while !fexists >>>>>> try >>>>>> fl = open(fname) >>>>>> try >>>>>> while !fexists >>>>>> conninfo = readline(fl) >>>>>> hostname, port = parse_connection_info(conninfo) >>>>>> fexists = (hostname != "") >>>>>> end >>>>>> finally >>>>>> close(fl) >>>>>> end >>>>>> catch >>>>>> print("."); >>>>>> sleep(0.5) >>>>>> end >>>>>> end >>>>>> >>>>>> After these modifications, >>>>>> >>>>>> addprocs_sge() >>>>>> works on a HP cluster running x86_64 GNU/Linux. >>>>>> Some feedback from other SGE users should be useful and perhaps this >>>>>> hack can be merged in julia base. >>>>>> >>>>>> Ben >>>>>> >>>>>> On Sunday, June 16, 2013 11:34:57 PM UTC+2, Theodore Papamarkou wrote: >>>>>>> >>>>>>> Thanks for trying this out Kevin. I tried the same after you and got >>>>>>> the same error, although the job was queued: >>>>>>> >>>>>>> julia> addprocs_sge(2) >>>>>>> >>>>>>> >>>>>>> >>>>>>> ERROR: assertion failed: ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> in error at error.jl:22 >>>>>>> >>>>>>> >>>>>>> >>>>>>> in assert at error.jl:43 >>>>>>> >>>>>>> >>>>>>> >>>>>>> in success at process.jl:394 >>>>>>> >>>>>>> >>>>>>> >>>>>>> in all at reduce.jl:175 >>>>>>> >>>>>>> >>>>>>> >>>>>>> in success at process.jl:401 >>>>>>> >>>>>>> >>>>>>> >>>>>>> in start_sge_workers at multi.jl:941 >>>>>>> >>>>>>> >>>>>>> >>>>>>> in addprocs_sge at multi.jl:976 >>>>>>> >>>>>>> $ qstat -u "ucaktpa" >>>>>>> job-ID prior name user state submit/start at >>>>>>> queue slots ja-task-ID >>>>>>> >>>>>>> ----------------------------------------------------------------------------------------------------------------- >>>>>>> 9696992 0.50290 JULIA ucaktpa qw 06/16/2013 22:16:14 >>>>>>> 1 1,2 >>>>>>> >>>>>>> I checked the line in multi.jl you mentioned, and was thinking that >>>>>>> I pass several other options to qsub, e.x. in order to allocate memory >>>>>>> or >>>>>>> set runtime thresholds (-l h_vmem=8G,vf=8G -l h_rt=0:3:0). It may be >>>>>>> good >>>>>>> to pass them as extra arguments to start_sge_workers(); alternatively, >>>>>>> we >>>>>>> could pass a single argument, which could be a configuration file, >>>>>>> similar >>>>>>> to the matlab sample code below: >>>>>>> >>>>>>> sched = findResource('scheduler', 'configuration', configuration); >>>>>>> >>>>>>> pjob = createParallelJob(sched); >>>>>>> >>>>>>> set(pjob, 'MaximumNumberOfWorkers', minNumWorkers); >>>>>>> set(pjob, 'MinimumNumberOfWorkers', maxNumWorkers); >>>>>>> >>>>>>> I will try to trace the addprocs_sge() error message... >>>>>>> >>>>>>> >>>>>>> On Sunday, June 16, 2013 10:05:05 PM UTC+1, Kevin Squire wrote: >>>>>>>> >>>>>>>> The relevant sge line in $JULIA_HOME/base/multi.jl has >>>>>>>> >>>>>>>> qsub_cmd = `echo $home/julia-release-basic --worker` | `qsub -N >>>>>>>> JULIA -terse -cwd -j y -o $sgedir -t 1:$n` >>>>>>>> >>>>>>>> So addprocs_sge() will do the qsub for you. When I just tried it, >>>>>>>> the workers started okay, but I received an error: >>>>>>>> >>>>>>>> julia> addprocs_sge(2) >>>>>>>> ERROR: assertion failed: ? >>>>>>>> in error at error.jl:22 >>>>>>>> in assert at error.jl:43 >>>>>>>> in success at process.jl:392 >>>>>>>> in map at abstractarray.jl:1478 >>>>>>>> in success at process.jl:394 >>>>>>>> in start_sge_workers at multi.jl:1009 >>>>>>>> in addprocs_sge at multi.jl:1044 >>>>>>>> >>>>>>>> $ qstat -u "kmsquire" >>>>>>>> job-ID prior name user state submit/start at >>>>>>>> queue slots ja-task-ID >>>>>>>> >>>>>>>> ----------------------------------------------------------------------------------------------------------------- >>>>>>>> 358164 10.50000 JULIA kmsquire r 06/16/2013 14:01:52 >>>>>>>> [email protected] 1 1 >>>>>>>> 358164 10.50000 JULIA kmsquire r 06/16/2013 14:01:52 >>>>>>>> [email protected] 1 2 >>>>>>>> >>>>>>>> >>>>>>>> Kevin >>>>>>>> >>>>>>>> >>>>>>>> On Sunday, June 16, 2013 1:31:28 PM UTC-7, Theodore Papamarkou >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The "--machinefile" option and the blogpost on distributed >>>>>>>>> numerical optimization are potentially excellent sources to help me, >>>>>>>>> thanks >>>>>>>>> a lot. I will try to make use of them and will post here once I make >>>>>>>>> some >>>>>>>>> progress. >>>>>>>>> >>>>>>>>> On Sunday, June 16, 2013 9:12:48 PM UTC+1, [email protected]: >>>>>>>>>> >>>>>>>>>> I haven't tried to do what you are describing yet, but I know a >>>>>>>>>> little. In SGE there should be a file named "machinefile" somewhere. >>>>>>>>>> It >>>>>>>>>> might be "$TMP/machinefile", but don't quote me. If you have this >>>>>>>>>> file, >>>>>>>>>> which contains the hostnames of the nodes, you should be able to >>>>>>>>>> pass it to >>>>>>>>>> julia on startup with the "--machinefile" option. An example of this >>>>>>>>>> is on >>>>>>>>>> the Julia blog >>>>>>>>>> http://julialang.org/blog/2013/04/distributed-numerical-optimization/ >>>>>>>>>> >>>>>>>>>> I hope that helps a little. >>>>>>>>>> >>>>>>>>>> On Sunday, June 16, 2013 5:06:03 AM UTC-5, Theodore Papamarkou >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I want to run a population MCMC simulation using power density >>>>>>>>>>> estimators on 50 nodes of the departmental cluster, which uses SGE. >>>>>>>>>>> Each of >>>>>>>>>>> the 50 nodes realizes a separate MCMC chain. The question >>>>>>>>>>> generalizes to >>>>>>>>>>> any parallel job which needs to reserve several nodes. I have found >>>>>>>>>>> two >>>>>>>>>>> relevant posts, namely >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IATlfsu4VJU/yw1y7N_dPg0J >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/cluster/julia-users/IlPuQSwtTSQ/vpGCPA27uMYJ >>>>>>>>>>> >>>>>>>>>>> but I haven't found a finalized set of instructions yet >>>>>>>>>>> to achieve the required result. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1. I guess the first step would be to instruct qsub directly >>>>>>>>>>> to reserve the required number of nodes. Typically, resource >>>>>>>>>>> requirements >>>>>>>>>>> are specified with the "-l" option. Is "qsub -l nodes=50" the >>>>>>>>>>> appropriate >>>>>>>>>>> invocation? >>>>>>>>>>> 2. The next question is how the number of processes nprocs() >>>>>>>>>>> relates >>>>>>>>>>> to the number of reserved nodes (workers). Should I call the >>>>>>>>>>> command >>>>>>>>>>> addprocs(50)? I think this is not good enough because Julia >>>>>>>>>>> would not know which nodes to select. addprocs({"hostname1", >>>>>>>>>>> "hostname2",... "hostnam50"}) seems to be the right command. >>>>>>>>>>> Nevertheless it is not handy to fetch the hostnames of the >>>>>>>>>>> reserved nodes. >>>>>>>>>>> So the natural question becomes whether addprocs_sge(50) does >>>>>>>>>>> what I have in mind; does this command require step 1 or it acts >>>>>>>>>>> as a >>>>>>>>>>> scheduler? >>>>>>>>>>> 3. Does a command such as remotecall(i, simulateMCMC, >>>>>>>>>>> MCMCargs) run one of the chains on the i-th node while I am >>>>>>>>>>> on the head node, i.e. on the node with myid() equal to 1? I >>>>>>>>>>> mean, do I regulate communciation in an MPI fashion by being on >>>>>>>>>>> a so called >>>>>>>>>>> head node, is this possible, or I have to change the >>>>>>>>>>> implementation? >>>>>>>>>>> >>>>>>>>>>> Thank you for any help and feedback - if I manage to make the >>>>>>>>>>> popMCMC simulation run on the cluster, I will share the code and >>>>>>>>>>> "howto". >>>>>>>>>>> >>>>>>>>>>>
