[elasticluster] Shorten configuration time
Hi Elasticluster spent nearly two hours for configuration of a cluster with 37 nodes. Considering that I am going to use a 1000-node cluster this means a lot time hence money for just configuration. Is there a way to speed up the configuration time? Or is it possible to skip some installations to save time? Regards, Orxan -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [elasticluster] Shorten configuration time
> > Start your large cluster from node snapshots > I already use a custom image but I don't differentiate between frontend and compute nodes hence, they both use the same custom image (or snapshot, assuming they are basically the same thing). Use larger nodes Unfortunately, multi-core nodes aren't useful for me because I am testing scalability of my program so each node should spend same amount of time for communication. Intra-node communication would spoil results since it is much faster than inter-node communication. Do you have any deadlines for your 1000-node cluster? I am at the end of my PhD work so I should finish simulation ASAP. I am happy to hear that you are going to improve configuration time but even if you pull configuration time from 20 min to 10 min per 10 nodes, for 1000 nodes this means nearly 30 hrs which is still not acceptable if simulation itself takes 1 hour to complete. I am just pointing out that cloud HPC is not cost-efficient in development and testing stage when frequent (parallel) debugging is needed and cluster cannot be kept open and should be closed immediately after usage to save money. But validated codes would benefit a lot from improvement in configuration time. On Tue, May 22, 2018 at 12:47 PM, Riccardo Murriwrote: > Hi Orxan, all, > > > Elasticluster spent nearly two hours for configuration of a cluster with > 37 > > nodes. > > Yes, this is definitely a pain point with ElastiCluster/Ansible ATM. > I'll try to > summarize the issue and give some suggestions here. > > My rule of thumb for time it takes to set up a basic SLURM cluster > with ElastiCluster is ~20 minutes every 10 nodes; that can quickly > become ~25 per 10 nodes if you are installing add-on software (e.g., > Ganglia) or if you have very bad SSH connection latency. I'd say your > experience of 2 hrs per ~40 nodes is in that ballpark. > > > Considering that I am going to use a 1000-node cluster this means a > > lot time hence money for just configuration. Is there a way to speed up > the > > configuration time? > > Yes: give me part of the money to work on scalability features :-) > > Srsly, what you can do *now* to cut down set up time (in decreasing > order of effectiveness): > > * Start your large cluster from node snapshots: > > 1. Create a cluster like the one you are about to start, but much > smaller (1 frontend + 1 compute node is enough) > 2. Make snapshots of the frontend and the compute node (and any > other node type you are using, e.g., GlusterFS data servers) > 3. Modify the large cluster configuration to use these snapshots > instead of the base OS images: > > [cluster/my-large-cluster] > # ... usual config > > [cluster/my-large-cluster/frontend] > image_id = id-of-frontend-snapshot > > [cluster/my-large-cluster/compute] > image_id = id-of-compute-snapshot > > This allows Ansible to "fast forward" on many time-consuming tasks > (e.g., installation of packages) > > * Use larger nodes -- setup time scales linearly with the number of > *nodes*, so you can get a cluster with the same number of cores but > fewer nodes (hence, quicker to setup) by using larger nodes. > > * Set environmental variable ANSIBLE_FORKS to a higher value: > ElastiCluster defaults to ANSIBLE_FORKS=10 but you should be able to > set this to 4x or 6x the number of cores in your ElastiCluster VM > safely. This allows more nodes to be set up at the same time. > > Lastly, I can make more stuff option (e.g., the "HPC standard" stuff) > -- there was some discussion on this maling list quite some time ago, > where people basically suggested that the basic install be kept as > minimal as possible. I have not given this task much priority up to > now, but it can be done relatively quickly. Do you have any deadlines > for your 1000-node cluster? > > More details and current plans for overcoming the issue at: > https://github.com/gc3-uzh-ch/elasticluster/issues/365 > > I'd be glad for any suggestions and a more in-depth discussion. > > Ciao, > R > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [elasticluster] SLURM is not installed after cluster setup
Initially permissions were like this: drwxrwxr-x 2 orhan orhan 4096 Şub 3 21:24 /home/orhan/.ansible drwxrwxr-x 3 orhan orhan 4096 Şub 4 16:15 /home/orhan/.elasticluster drwx-- 2 orhan orhan 4096 Oca 29 19:57 /home/orhan/.ssh After the commands it became: drwxrwxrwx 2 orhan orhan 4096 Şub 3 21:24 /home/orhan/.ansible drwxrwxrwx 3 orhan orhan 4096 Şub 4 16:15 /home/orhan/.elasticluster drwx---rwx 2 orhan orhan 4096 Oca 29 19:57 /home/orhan/.ssh However, that Errno 13 is still there. Error message is as follows: 'import sitecustomize' failed; use -v for traceback Traceback (most recent call last): File "/usr/local/bin/ansible-playbook", line 43, in import ansible.constants as C File "/usr/local/lib/python2.7/site-packages/ansible/constants.py", line 202, in DEFAULT_LOCAL_TMP = get_config(p, DEFAULTS, 'local_tmp', 'ANSIBLE_LOCAL_TEMP', '~/.ansible/tmp', value_type='tmppath') File "/usr/local/lib/python2.7/site-packages/ansible/constants.py", line 109, in get_config makedirs_safe(value, 0o700) File "/usr/local/lib/python2.7/site-packages/ansible/utils/path.py", line 71, in makedirs_safe raise AnsibleError("Unable to create local directories(%s): %s" % (to_native(rpath), to_native(e))) ansible.errors.AnsibleError: Unable to create local directories(/home/.ansible/tmp): [Errno 13] Permission denied: '/home/.ansible' 2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] ERROR Command `ansible-playbook /home/elasticluster/share/playbooks/site.yml --inventory=/home/orhan/.elasticluster/storage/slurm-on-gce.inventory --become --become-user=root -vv` failed with exit code 1. 2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] ERROR Check the output lines above for additional information on this error. 2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] ERROR The cluster has likely *not* been configured correctly. You may need to re-run `elasticluster setup` or fix the playbooks. 2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] WARNING Cluster `slurm-on-gce` not yet configured. Please, re-run `elasticluster setup slurm-on-gce` and/or check your configuration Orhan On Sun, Feb 4, 2018 at 3:36 PM, Riccardo Murriwrote: > Dear Orxan, > > the following subdirectories of your home directory should be owned > and writable by your Linux accoun (which is `rmurri` in my case)t: > > $ ls -ld $HOME/.ansible $HOME/.elasticluster $HOME/.ssh > drwxrwxr-x 5 rmurri rmurri 4096 feb 2 2015 /home/rmurri/.ansible > drwxrwxr-x 3 rmurri rmurri 4096 feb 3 21:15 /home/rmurri/.elasticluster > drwxr-xr-x 3 rmurri rmurri 4096 gen 19 16:29 /home/rmurri/.ssh > > If they aren't, try running the following command to fix the permissions > > sudo chown -v -R $(whoami) $HOME/.ansible $HOME/.elasticluster > $HOME/.ssh > sudo chmod -v o+rwX $HOME/.ansible $HOME/.elasticluster $HOME/.ssh > > If it still doesn't work, please post the output of the above two > commands along with error message produced by ElastiCluster. > > Ciao, > R > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [elasticluster] SLURM is not installed after cluster setup
The `sudo` issue is solved but [Errno 13] is still there. Output is attached. Orhan On Sun, Feb 4, 2018 at 2:31 PM, Riccardo Murri <riccardo.mu...@gmail.com> wrote: > 2018-02-04 12:15 GMT+01:00 Orxan Shibliyev <orxan.shi...@gmail.com>: > > The second command gave: > > > > orhan@orhan-MS-7850:~$ ./elasticluster.sh -vvv start slurm-on-gce > > docker: Got permission denied while trying to connect to the Docker > daemon > > socket at unix:///var/run/docker.sock: Post > > http://%2Fvar%2Frun%2Fdocker.sock/v1.31/containers/create: dial unix > > /var/run/docker.sock: connect: permission denied. > > > > Then you probably need to add yourself to the `docker` group: > > sudo gpasswd -a $(whoami) docker > > Note: replace `docker` above with whatever group owns the socket > `/var/run/docler.sock` > > You might need to log out and back in order for the additional change > to be picked up; or run `newgrp docker` to get a shell with the > correct permissions. > > Please let me know if it works, so I can automate this in the > `elasticluster.sh` script. > > Ciao, > R > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. orhan@orhan-MS-7850:~$ ./elasticluster.sh -vvv start slurm-on-gce 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `cluster/slurm-on-gce` ... 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `cluster/gridengine-on-gce` ... 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `login/google` ... 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `setup/gridengine` ... 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `setup/slurm` ... 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `setup/pbs` ... 2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section `cloud/google` ... 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG Using class from module to instanciate provider 'google' 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG Using class from module to instanciate provider 'ansible' 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG setting variable multiuser_cluster=yes for node kind compute 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG setting variable multiuser_cluster=yes for node kind frontend 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG setting variable multiuser_cluster=yes for node kind submit Starting cluster `slurm-on-gce` with: * 1 frontend nodes. * 2 compute nodes. (This may take a while...) 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting cluster nodes ... 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG Note: starting 3 nodes concurrently. 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG _start_node: working on node `frontend001` 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting node `frontend001` from image `ubuntu-1604-xenial-v20180126` with flavor n1-standard-1 ... 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG _start_node: working on node `compute002` 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG _start_node: working on node `compute001` 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting node `compute002` from image `ubuntu-1604-xenial-v20180126` with flavor n1-standard-1 ... 2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting node `compute001` from image `ubuntu-1604-xenial-v20180126` with flavor n1-standard-1 ... 2018-02-04 14:54:47 41e0a6cea578 gc3.elasticluster[1] DEBUG Node `compute002` has instance ID `slurm-on-gce-compute002` 2018-02-04 14:54:47 41e0a6cea578 gc3.elasticluster[1] INFO Node `compute002` has been started. 2018-02-04 14:55:16 41e0a6cea578 gc3.elasticluster[1] DEBUG Node `frontend001` has instance ID `slurm-on-gce-frontend001` 2018-02-04 14:55:16 41e0a6cea578 gc3.elasticluster[1] INFO Node `frontend001` has been started. 2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] DEBUG Node `compute001` has instance ID `slurm-on-gce-compute001` 2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] INFO Node `compute001` has been started. 2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] DEBUG Getting information for instance slurm-on-gce-compute002 2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] DEBUG node `compute002` (instance id slurm-on-gce-compute002) is up. 2018-02-04 14:55:21 41e0a6cea578 gc3.elasticluster[1] DEBUG Getting information for instance slurm-on-gce-frontend001 2018-02-04 14:55:21 41e0a6cea578
[elasticluster] sinfo gives wrong wrong number of nodes after resize
Hi Initially, I made one front end and two compute nodes. In front end, `sinfo` reported number of nodes as two. Then I added five more compute nodes with `./elasticluster.sh resize -a 5:compute slurm-on-gce`. As expected, I got the compute nodes however, in front end, `sinfo` gives the same information that is, two nodes. Orhan -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [elasticluster] SLURM sbatch error
Your test does not work for me. Restarting SLURM does not help. Base OS is Debian GNU/Linux 9.4 (stretch). I get errors related to lmod TASK [lmod : Is installation directory writable?] ** fatal: [compute003]: FAILED! => {"changed": true, "cmd": ["test", "-w", "/opt/lmod/7.0/"], "delta": "0:00:00.010908", "end": "2018-04-19 14:05:07.669722", "failed": true, "rc": 1, "start": "2018-04-19 14:05:07.658814", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring fatal: [compute002]: FAILED! => {"changed": true, "cmd": ["test", "-w", "/opt/lmod/7.0/"], "delta": "0:00:00.035474", "end": "2018-04-19 14:05:08.090735", "failed": true, "rc": 1, "start": "2018-04-19 14:05:08.055261", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} ...ignoring and other errors such as these: compute001 : ok=7changed=1unreachable=0failed=1 compute002 : ok=121 changed=79 unreachable=0failed=0 compute003 : ok=121 changed=79 unreachable=0failed=0 frontend001 : ok=124 changed=87 unreachable=0failed=0 Command `ansible-playbook --private-key=/home/orhan/.ssh/google_compute_engine /home/elasticluster/share/playbooks/site.yml --inventory=/home/orhan/.elasticluster/storage/slurm-on-gce.inventory --become --become-user=root -e elasticluster_output_dir=/tmp/elasticluster.2WFV9u.d` failed with exit code 2. I think in my previous tries only lmod related errors existed. For some reason I considered them as warnings instead of errors. *Config:* [cloud/google] noauth_local_webserver=yes provider=google gce_client_id=<> gce_client_secret=<> gce_project_id=tailor-193612 [login/google] image_user=orxan.shibli image_sudo=yes user_key_name=elasticluster user_key_private=~/.ssh/google_compute_engine user_key_public=~/.ssh/google_compute_engine.pub [setup/slurm] frontend_groups=slurm_master compute_groups=slurm_worker submit_groups=slurm_submit,glusterfs_client global_var_multiuser_cluster=yes [cluster/slurm-on-gce] setup=slurm frontend_nodes=1 compute_nodes=3 ssh_to=frontend cloud=google login=google flavor=n1-standard-1 security_group=default image_id= https://www.googleapis.com/compute/v1/projects/tailor-193612/global/images/image-23 On Thu, Apr 19, 2018 at 3:17 PM, Riccardo Murriwrote: > Hello Orxan, > > I cannot reproduce this error; with a freshly-started Ubuntu 16.04 > cluster, I get:: > > ubuntu@frontend001:~$ cat test.sh > #! /bin/sh > > echo hello > > ubuntu@frontend001:~$ sbatch test.sh > Submitted batch job 2 > > ubuntu@frontend001:~$ cat slurm-2.out > hello > > One caveat: right after building the cluster, the SLURM controller > daemon was not running -- I had to restart it with "sudo service > slurmctld restart". > > Did you get any errors while building the cluster? What base OS are > you using? What config? > > Ciao, > R > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] SLURM sbatch error
Hi My the very same `sbatch` script gave error after `sbatch submit.sh`: *Error message: * sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified *submit.sh:* #!/bin/bash# SBATCH --nodes=3-3 #SBATCH --ntasks=3 #SBATCH -t 10:00:0 #SBATCH --output=slurm-%j.out #SBATCH --error=slurm-%j.err #SBATCH --mem 3000 mpirun ./out 3 Also it does not matter if I add account and partition: #SBATCH -A orxan_shibli #SBATCH -p main On frontend, I tried things like: `sudo service slurmctld restart` `sudo service slurmd restart` but nothing helped. I recently re-installed elasticluster. I haven't got this error with the same script in previous installation. Has something changed about SLURM configuration recently? Regards, Orxan -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] ERR_CONNECTION_REFUSED
I have been using elasticluster with great pleasure. After dist upgrade I wanted to install elasticluster again. I have an issue which I somehow solved in the first installation. But this time I don't remember how I did it. I think this is related to google but after some research I am still clueless. Basically, I get following. Your browser has been opened to visit: If your browser is on a different machine then exit and re-run this application with the command-line parameter --noauth_local_webserver When I open the link and allow I get: This site can’t be reached localhost refused to connect. Search Google for localhost 8080 ERR_CONNECTION_REFUSED -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] Error: Ensure the APT package cache is updated
Until now elasticluster was working perfectly. I have changed nothing but today I got the following error for "./elasticluster.sh start slurm-on-gce". What is the problem? ... TASK [common : Ensure the APT package cache is updated] fatal: [frontend001]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute005]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute006]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute007]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute003]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute001]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute002]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute009]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute004]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute008]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} fatal: [compute010]: FAILED! => {"changed": false, "failed": true, "msg": "Failed to update apt cache."} [WARNING]: Could not create retry file '/home/elasticluster/share/playbooks/site.retry'. [Errno 13] Permission denied: u'/home/elasticluster/share/playbooks/site.retry' PLAY RECAP * compute001 : ok=4changed=0unreachable=0failed=1 compute002 : ok=4changed=0unreachable=0failed=1 compute003 : ok=4changed=0unreachable=0failed=1 compute004 : ok=4changed=0unreachable=0failed=1 compute005 : ok=4changed=0unreachable=0failed=1 compute006 : ok=4changed=0unreachable=0failed=1 compute007 : ok=4changed=0unreachable=0failed=1 compute008 : ok=4changed=0unreachable=0failed=1 compute009 : ok=4changed=0unreachable=0failed=1 compute010 : ok=4changed=0unreachable=0failed=1 frontend001: ok=4changed=0unreachable=0failed=1 2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] ERROR Command `ansible-playbook /home/elasticluster/share/playbooks/site.yml --inventory=/home/orhan/.elasticluster/storage/slurm-on-gce.inventory --become --become-user=root` failed with exit code 2. 2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] ERROR Check the output lines above for additional information on this error. 2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] ERROR The cluster has likely *not* been configured correctly. You may need to re-run `elasticluster setup` or fix the playbooks. 2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] WARNING Cluster `slurm-on-gce` not yet configured. Please, re-run `elasticluster setup slurm-on-gce` and/or check your configuration ... -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] Different flavors for frontend and compute nodes
Is it possible to take different flavors for frontend and compute nodes? I want high memory machine for frontend and lower memory for computes. I use GCE machines. -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [elasticluster] Different flavors for frontend and compute nodes
Thanks for the reply. I just realized submitting a job with 1 node with slurm, launches compute001 although I did not specify the node. Seems like elasticluster does not intend to use frontend for computations. I don't know if should I ask this question in separate post but is it possible to launch the high-memory requiring job on frontend instead of compute node? If this is not possible, then how can I make two sections for compute nodes instead of sectioning compute and frontend? On Tue, Dec 18, 2018 at 1:13 PM Manuele Simi wrote: > Yes, it is possible. > > You need to define a cluster// sections that override the > cluster/ section. > > In the following example I create specific configurations (with their own > flavor) for the nodes in the compute and frontend groups of a cluster named > "gridengine". > > # Cluster Section > *[cluster/gridengine]* > ... > *frontend_nodes=1* > *compute_nodes=3* > > # Compute node section > *[cluster/gridengine/compute]* > *flavor=n1-highcpu-2* > ... > > # Frontend node section > *[cluster/gridengine/frontend]* > *flavor=n1-standard-64* > ... > > On Tue, Dec 18, 2018 at 8:46 AM Orxan Shibliyev > wrote: > >> Is it possible to take different flavors for frontend and compute nodes? >> I want high memory machine for frontend and lower memory for computes. I >> use GCE machines. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticluster" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticluster+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] Re: SLURM: Unable to contact slurm controller
Please disregard my previous post. I didn't even construct cluster but just one instance. Sorry for taking time. On Thu, Dec 20, 2018 at 1:44 PM Orxan Shibliyev wrote: > For some reason I get "Unable to contact slurm controller (connect > failure)" for any SLURM command. I constructed cluster as usual but this > time it gives the mentioned error. What could be the reason? > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] SLURM: Unable to contact slurm controller
For some reason I get "Unable to contact slurm controller (connect failure)" for any SLURM command. I constructed cluster as usual but this time it gives the mentioned error. What could be the reason? -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[elasticluster] Elasticluster copies files before job submission
When I run a job by setting number of nodes to 1 and number of tasks to 1 as well, naturally, only compute001 runs the job. Then I run 8-node job and I see that output of first job which ran on compute001 are also available on other nodes. Does elasticluster copy files among compute nodes before job submission? -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [elasticluster] Elasticluster copies files before job submission
So when a node produce a file, the file will be copied to all other nodes, right? What if nodes produce a file with same name but different content? Which file will be read by a node? On Fri, Dec 21, 2018 at 3:57 PM Riccardo Murri wrote: > Hello Orhan, > > > When I run a job by setting number of nodes to 1 and number of tasks to > 1 as well, naturally, only compute001 runs the job. Then I run 8-node job > and I see that output of first job which ran on compute001 are also > available on other nodes. Does elasticluster copy files among compute nodes > before job submission? > > Home directories in the cluster are shared -- jobs see the same files > on all compute nodes across the cluster. (This makes it easier to read > and exchange data, as you do not need to stage it prior to running the > job.) > > Does this answer your question? > > Ciao, > R > -- You received this message because you are subscribed to the Google Groups "elasticluster" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticluster+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.