from:"Orxan Shibliyev"

[elasticluster] Shorten configuration time

2018-05-19 Thread Orxan Shibliyev

Hi

Elasticluster spent nearly two hours for configuration of a cluster with 37
nodes. Considering that I am going to use a 1000-node cluster this means a
lot time hence money for just configuration. Is there a way to speed up the
configuration time? Or is it possible to skip some installations to save
time?

Regards,
Orxan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elasticluster] Shorten configuration time

2018-05-22 Thread Orxan Shibliyev

>
> Start your large cluster from node snapshots
>

I already use a custom image but I don't differentiate between frontend and
compute nodes hence, they both use the same custom image (or snapshot,
assuming they are basically the same thing).

Use larger nodes


Unfortunately, multi-core nodes aren't useful for me because I am testing
scalability of my program so each node should spend same amount of time for
communication. Intra-node communication would spoil results since it is
much faster than inter-node communication.

Do you have any deadlines for your 1000-node cluster?


I am at the end of my PhD work so I should finish simulation ASAP. I am
happy to hear that you are going to improve configuration time but even if
you pull configuration time from 20 min to 10 min per 10 nodes, for 1000
nodes this means nearly 30 hrs which is still not acceptable if simulation
itself takes 1 hour to complete. I am just pointing out that cloud HPC is
not cost-efficient in development and testing stage when frequent
(parallel) debugging is needed and cluster cannot be kept open and should
be closed immediately after usage to save money. But validated codes would
benefit a lot from improvement in configuration time.

On Tue, May 22, 2018 at 12:47 PM, Riccardo Murri 
wrote:

> Hi Orxan, all,
>
> > Elasticluster spent nearly two hours for configuration of a cluster with
> 37
> > nodes.
>
> Yes, this is definitely a pain point with ElastiCluster/Ansible ATM.
> I'll try to
> summarize the issue and give some suggestions here.
>
> My rule of thumb for time it takes to set up a basic SLURM cluster
> with ElastiCluster is ~20 minutes every 10 nodes; that can quickly
> become ~25 per 10 nodes if you are installing add-on software (e.g.,
> Ganglia) or if you have very bad SSH connection latency. I'd say your
> experience of 2 hrs per ~40 nodes is in that ballpark.
>
> > Considering that I am going to use a 1000-node cluster this means a
> > lot time hence money for just configuration. Is there a way to speed up
> the
> > configuration time?
>
> Yes: give me part of the money to work on scalability features :-)
>
> Srsly, what you can do *now* to cut down set up time (in decreasing
> order of effectiveness):
>
> * Start your large cluster from node snapshots:
>
>   1. Create a cluster like the one you are about to start, but much
> smaller (1 frontend + 1 compute node is enough)
>   2. Make snapshots of the frontend and the compute node (and any
> other node type you are using, e.g., GlusterFS data servers)
>   3. Modify the large cluster configuration to use these snapshots
> instead of the base OS images:
>
> [cluster/my-large-cluster]
> # ... usual config
>
> [cluster/my-large-cluster/frontend]
> image_id = id-of-frontend-snapshot
>
> [cluster/my-large-cluster/compute]
> image_id = id-of-compute-snapshot
>
>   This allows Ansible to "fast forward" on many time-consuming tasks
> (e.g., installation of packages)
>
> * Use larger nodes -- setup time scales linearly with the number of
> *nodes*, so you can get a cluster with the same number of cores but
> fewer nodes (hence, quicker to setup) by using larger nodes.
>
> * Set environmental variable ANSIBLE_FORKS to a higher value:
> ElastiCluster defaults to ANSIBLE_FORKS=10 but you should be able to
> set this to 4x or 6x the number of cores in your ElastiCluster VM
> safely.  This allows more nodes to be set up at the same time.
>
> Lastly, I can make more stuff option (e.g., the "HPC standard" stuff)
> -- there was some discussion on this maling list quite some time ago,
> where people basically suggested that the basic install be kept as
> minimal as possible.  I have not given this task much priority up to
> now, but it can be done relatively quickly.  Do you have any deadlines
> for your 1000-node cluster?
>
> More details and current plans for overcoming the issue at:
> https://github.com/gc3-uzh-ch/elasticluster/issues/365
>
> I'd be glad for any suggestions and a more in-depth discussion.
>
> Ciao,
> R
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elasticluster] SLURM is not installed after cluster setup

2018-02-04 Thread Orxan Shibliyev

Initially permissions were like this:

drwxrwxr-x 2 orhan orhan 4096 Şub  3 21:24 /home/orhan/.ansible
drwxrwxr-x 3 orhan orhan 4096 Şub  4 16:15 /home/orhan/.elasticluster
drwx-- 2 orhan orhan 4096 Oca 29 19:57 /home/orhan/.ssh

After the commands it became:

drwxrwxrwx 2 orhan orhan 4096 Şub  3 21:24 /home/orhan/.ansible
drwxrwxrwx 3 orhan orhan 4096 Şub  4 16:15 /home/orhan/.elasticluster
drwx---rwx 2 orhan orhan 4096 Oca 29 19:57 /home/orhan/.ssh

However, that Errno 13 is still there. Error message is as follows:

'import sitecustomize' failed; use -v for traceback
Traceback (most recent call last):
  File "/usr/local/bin/ansible-playbook", line 43, in 
import ansible.constants as C
  File "/usr/local/lib/python2.7/site-packages/ansible/constants.py", line
202, in 
DEFAULT_LOCAL_TMP = get_config(p, DEFAULTS, 'local_tmp',
 'ANSIBLE_LOCAL_TEMP',  '~/.ansible/tmp', value_type='tmppath')
  File "/usr/local/lib/python2.7/site-packages/ansible/constants.py", line
109, in get_config
makedirs_safe(value, 0o700)
  File "/usr/local/lib/python2.7/site-packages/ansible/utils/path.py", line
71, in makedirs_safe
raise AnsibleError("Unable to create local directories(%s): %s" %
(to_native(rpath), to_native(e)))
ansible.errors.AnsibleError: Unable to create local
directories(/home/.ansible/tmp): [Errno 13] Permission denied:
'/home/.ansible'
2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] ERROR Command
`ansible-playbook /home/elasticluster/share/playbooks/site.yml
--inventory=/home/orhan/.elasticluster/storage/slurm-on-gce.inventory
--become --become-user=root -vv` failed with exit code 1.
2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] ERROR Check the
output lines above for additional information on this error.
2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] ERROR The cluster has
likely *not* been configured correctly. You may need to re-run
`elasticluster setup` or fix the playbooks.
2018-02-04 15:56:38 cfeda8a7b8b3 gc3.elasticluster[1] WARNING Cluster
`slurm-on-gce` not yet configured. Please, re-run `elasticluster setup
slurm-on-gce` and/or check your configuration

Orhan

On Sun, Feb 4, 2018 at 3:36 PM, Riccardo Murri 
wrote:

> Dear Orxan,
>
> the following subdirectories of your home directory should be owned
> and writable by your Linux accoun (which is `rmurri` in my case)t:
>
>  $ ls -ld $HOME/.ansible $HOME/.elasticluster $HOME/.ssh
> drwxrwxr-x 5 rmurri rmurri 4096 feb  2  2015 /home/rmurri/.ansible
> drwxrwxr-x 3 rmurri rmurri 4096 feb  3 21:15 /home/rmurri/.elasticluster
> drwxr-xr-x 3 rmurri rmurri 4096 gen 19 16:29 /home/rmurri/.ssh
>
> If they aren't, try running the following command to fix the permissions
>
> sudo chown -v -R $(whoami) $HOME/.ansible $HOME/.elasticluster
> $HOME/.ssh
> sudo chmod -v o+rwX $HOME/.ansible $HOME/.elasticluster $HOME/.ssh
>
> If it still doesn't work, please post the output of the above two
> commands along with error message produced by ElastiCluster.
>
> Ciao,
> R
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elasticluster] SLURM is not installed after cluster setup

2018-02-04 Thread Orxan Shibliyev

The `sudo` issue is solved but [Errno 13] is still there. Output is
attached.

Orhan

On Sun, Feb 4, 2018 at 2:31 PM, Riccardo Murri <riccardo.mu...@gmail.com>
wrote:

> 2018-02-04 12:15 GMT+01:00 Orxan Shibliyev <orxan.shi...@gmail.com>:
> > The second command gave:
> >
> > orhan@orhan-MS-7850:~$ ./elasticluster.sh -vvv start slurm-on-gce
> > docker: Got permission denied while trying to connect to the Docker
> daemon
> > socket at unix:///var/run/docker.sock: Post
> > http://%2Fvar%2Frun%2Fdocker.sock/v1.31/containers/create: dial unix
> > /var/run/docker.sock: connect: permission denied.
> >
>
> Then you probably need to add yourself to the `docker` group:
>
> sudo gpasswd -a $(whoami) docker
>
> Note: replace `docker` above with whatever group owns the socket
> `/var/run/docler.sock`
>
> You might need to log out and back in order for the additional change
> to be picked up; or run `newgrp docker` to get a shell with the
> correct permissions.
>
> Please let me know if it works, so I can automate this in the
> `elasticluster.sh` script.
>
> Ciao,
> R
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
orhan@orhan-MS-7850:~$ ./elasticluster.sh -vvv start slurm-on-gce
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`cluster/slurm-on-gce` ...
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`cluster/gridengine-on-gce` ...
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`login/google` ...
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`setup/gridengine` ...
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`setup/slurm` ...
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`setup/pbs` ...
2018-02-04 14:54:34 41e0a6cea578 gc3.elasticluster[1] DEBUG Checking section 
`cloud/google` ...
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG Using class  from module  to 
instanciate provider 'google'
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG Using class  from module 
 to instanciate provider 'ansible'
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG setting variable 
multiuser_cluster=yes for node kind compute
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG setting variable 
multiuser_cluster=yes for node kind frontend
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG setting variable 
multiuser_cluster=yes for node kind submit
Starting cluster `slurm-on-gce` with:
* 1 frontend nodes.
* 2 compute nodes.
(This may take a while...)
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting cluster 
nodes ...
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG Note: starting 3 
nodes concurrently.
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG _start_node: 
working on node `frontend001`
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting node 
`frontend001` from image `ubuntu-1604-xenial-v20180126` with flavor 
n1-standard-1 ...
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG _start_node: 
working on node `compute002`
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] DEBUG _start_node: 
working on node `compute001`
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting node 
`compute002` from image `ubuntu-1604-xenial-v20180126` with flavor 
n1-standard-1 ...
2018-02-04 14:54:35 41e0a6cea578 gc3.elasticluster[1] INFO Starting node 
`compute001` from image `ubuntu-1604-xenial-v20180126` with flavor 
n1-standard-1 ...
2018-02-04 14:54:47 41e0a6cea578 gc3.elasticluster[1] DEBUG Node `compute002` 
has instance ID `slurm-on-gce-compute002`
2018-02-04 14:54:47 41e0a6cea578 gc3.elasticluster[1] INFO Node `compute002` 
has been started.
2018-02-04 14:55:16 41e0a6cea578 gc3.elasticluster[1] DEBUG Node `frontend001` 
has instance ID `slurm-on-gce-frontend001`
2018-02-04 14:55:16 41e0a6cea578 gc3.elasticluster[1] INFO Node `frontend001` 
has been started.
2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] DEBUG Node `compute001` 
has instance ID `slurm-on-gce-compute001`
2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] INFO Node `compute001` 
has been started.
2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] DEBUG Getting information 
for instance slurm-on-gce-compute002
2018-02-04 14:55:20 41e0a6cea578 gc3.elasticluster[1] DEBUG node `compute002` 
(instance id slurm-on-gce-compute002) is up.
2018-02-04 14:55:21 41e0a6cea578 gc3.elasticluster[1] DEBUG Getting information 
for instance slurm-on-gce-frontend001
2018-02-04 14:55:21 41e0a6cea578

[elasticluster] sinfo gives wrong wrong number of nodes after resize

2018-02-04 Thread Orxan Shibliyev

Hi

Initially, I made one front end and two compute nodes. In front end,
`sinfo` reported number of nodes as two. Then I added five more compute
nodes with `./elasticluster.sh resize -a 5:compute slurm-on-gce`. As
expected, I got the compute nodes however, in front end, `sinfo` gives the
same information that is, two nodes.

Orhan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elasticluster] SLURM sbatch error

2018-04-19 Thread Orxan Shibliyev

Your test does not work for me. Restarting SLURM does not help. Base OS is
Debian GNU/Linux 9.4 (stretch). I get errors related to lmod

TASK [lmod : Is installation directory writable?]
**
fatal: [compute003]: FAILED! => {"changed": true, "cmd": ["test", "-w",
"/opt/lmod/7.0/"], "delta": "0:00:00.010908", "end": "2018-04-19
14:05:07.669722", "failed": true, "rc": 1, "start": "2018-04-19
14:05:07.658814", "stderr": "", "stderr_lines": [], "stdout": "",
"stdout_lines": []}
...ignoring
fatal: [compute002]: FAILED! => {"changed": true, "cmd": ["test", "-w",
"/opt/lmod/7.0/"], "delta": "0:00:00.035474", "end": "2018-04-19
14:05:08.090735", "failed": true, "rc": 1, "start": "2018-04-19
14:05:08.055261", "stderr": "", "stderr_lines": [], "stdout": "",
"stdout_lines": []}
...ignoring

 and other errors such as these:

compute001   : ok=7changed=1unreachable=0failed=1
compute002   : ok=121  changed=79   unreachable=0failed=0
compute003   : ok=121  changed=79   unreachable=0failed=0
frontend001   : ok=124  changed=87   unreachable=0failed=0

Command `ansible-playbook
--private-key=/home/orhan/.ssh/google_compute_engine
/home/elasticluster/share/playbooks/site.yml
--inventory=/home/orhan/.elasticluster/storage/slurm-on-gce.inventory
--become --become-user=root -e
elasticluster_output_dir=/tmp/elasticluster.2WFV9u.d` failed with exit code
2.

I think in my previous tries only lmod related errors existed. For some
reason I considered them as warnings instead of errors.

*Config:*

[cloud/google]
noauth_local_webserver=yes
provider=google
gce_client_id=<>
gce_client_secret=<>
gce_project_id=tailor-193612

[login/google]
image_user=orxan.shibli
image_sudo=yes
user_key_name=elasticluster
user_key_private=~/.ssh/google_compute_engine
user_key_public=~/.ssh/google_compute_engine.pub

[setup/slurm]
frontend_groups=slurm_master
compute_groups=slurm_worker
submit_groups=slurm_submit,glusterfs_client
global_var_multiuser_cluster=yes

[cluster/slurm-on-gce]
setup=slurm
frontend_nodes=1
compute_nodes=3
ssh_to=frontend
cloud=google
login=google
flavor=n1-standard-1
security_group=default
image_id=
https://www.googleapis.com/compute/v1/projects/tailor-193612/global/images/image-23

On Thu, Apr 19, 2018 at 3:17 PM, Riccardo Murri 
wrote:

> Hello Orxan,
>
> I cannot reproduce this error; with a freshly-started Ubuntu 16.04
> cluster, I get::
>
> ubuntu@frontend001:~$ cat test.sh
> #! /bin/sh
>
> echo hello
>
> ubuntu@frontend001:~$ sbatch test.sh
> Submitted batch job 2
>
> ubuntu@frontend001:~$ cat slurm-2.out
> hello
>
> One caveat: right after building the cluster, the SLURM controller
> daemon was not running -- I had to restart it with "sudo service
> slurmctld restart".
>
> Did you get any errors while building the cluster?  What base OS are
> you using? What config?
>
> Ciao,
> R
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] SLURM sbatch error

2018-04-19 Thread Orxan Shibliyev

Hi

My the very same `sbatch` script gave error after `sbatch submit.sh`:

*Error message: *

sbatch: error: Batch job submission failed: Invalid account or
account/partition combination specified

*submit.sh:*

#!/bin/bash#
SBATCH --nodes=3-3
#SBATCH --ntasks=3
#SBATCH -t 10:00:0
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH --mem 3000
mpirun ./out 3

Also it does not matter if I add account and partition:

#SBATCH -A orxan_shibli
#SBATCH -p main

On frontend, I tried things like:

`sudo service slurmctld restart`
`sudo service slurmd restart`

but nothing helped.

I recently re-installed elasticluster. I haven't got this error with the
same script in previous installation. Has something changed about SLURM
configuration recently?

Regards,
Orxan

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] ERR_CONNECTION_REFUSED

2018-04-16 Thread Orxan Shibliyev

I have been using elasticluster with great pleasure. After dist upgrade I
wanted to install elasticluster again. I have an issue which I somehow
solved in the first installation. But this time I don't remember how I did
it. I think this is related to google but after some research I am still
clueless. Basically, I get following.

Your browser has been opened to visit:



If your browser is on a different machine then exit and re-run this
application with the command-line parameter

  --noauth_local_webserver

When I open the link and allow I get:

This site can’t be reached
localhost refused to connect.
Search Google for localhost 8080
ERR_CONNECTION_REFUSED

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] Error: Ensure the APT package cache is updated

2018-04-02 Thread Orxan Shibliyev

Until now elasticluster was working perfectly. I have changed nothing but
today I got the following error for "./elasticluster.sh start
slurm-on-gce". What is the problem?

...

TASK [common : Ensure the APT package cache is updated]

fatal: [frontend001]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute005]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute006]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute007]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute003]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute001]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute002]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute009]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute004]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute008]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
fatal: [compute010]: FAILED! => {"changed": false, "failed": true, "msg":
"Failed to update apt cache."}
 [WARNING]: Could not create retry file
'/home/elasticluster/share/playbooks/site.retry'. [Errno 13]
Permission denied: u'/home/elasticluster/share/playbooks/site.retry'


PLAY RECAP
*
compute001 : ok=4changed=0unreachable=0failed=1

compute002 : ok=4changed=0unreachable=0failed=1

compute003 : ok=4changed=0unreachable=0failed=1

compute004 : ok=4changed=0unreachable=0failed=1

compute005 : ok=4changed=0unreachable=0failed=1

compute006 : ok=4changed=0unreachable=0failed=1

compute007 : ok=4changed=0unreachable=0failed=1

compute008 : ok=4changed=0unreachable=0failed=1

compute009 : ok=4changed=0unreachable=0failed=1

compute010 : ok=4changed=0unreachable=0failed=1

frontend001: ok=4changed=0unreachable=0failed=1


2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] ERROR Command
`ansible-playbook /home/elasticluster/share/playbooks/site.yml
--inventory=/home/orhan/.elasticluster/storage/slurm-on-gce.inventory
--become --become-user=root` failed with exit code 2.
2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] ERROR Check the
output lines above for additional information on this error.
2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] ERROR The cluster has
likely *not* been configured correctly. You may need to re-run
`elasticluster setup` or fix the playbooks.
2018-04-02 13:08:43 fe971db10edf gc3.elasticluster[1] WARNING Cluster
`slurm-on-gce` not yet configured. Please, re-run `elasticluster setup
slurm-on-gce` and/or check your configuration

...

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] Different flavors for frontend and compute nodes

2018-12-18 Thread Orxan Shibliyev

Is it possible to take different flavors for frontend and compute nodes? I
want high memory machine for frontend and lower memory for computes. I use
GCE machines.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elasticluster] Different flavors for frontend and compute nodes

2018-12-18 Thread Orxan Shibliyev

Thanks for the reply. I just realized submitting a job with 1 node with
slurm, launches compute001 although I did not specify the node. Seems like
elasticluster does not intend to use frontend for computations. I don't
know if should I ask this question in separate post but is it possible to
launch the high-memory requiring job on frontend instead of compute node?
If this is not possible, then how can I make two sections for compute nodes
instead of sectioning compute and frontend?

On Tue, Dec 18, 2018 at 1:13 PM Manuele Simi  wrote:

> Yes, it is possible.
>
> You need to define a cluster// sections that override the
> cluster/ section.
>
> In the following example I create specific configurations (with their own
> flavor) for the nodes in the compute and frontend groups of a cluster named
> "gridengine".
>
> # Cluster Section
> *[cluster/gridengine]*
> ...
> *frontend_nodes=1*
> *compute_nodes=3*
>
> # Compute node section
> *[cluster/gridengine/compute]*
> *flavor=n1-highcpu-2*
> ...
>
> # Frontend node section
> *[cluster/gridengine/frontend]*
> *flavor=n1-standard-64*
> ...
>
> On Tue, Dec 18, 2018 at 8:46 AM Orxan Shibliyev 
> wrote:
>
>> Is it possible to take different flavors for frontend and compute nodes?
>> I want high memory machine for frontend and lower memory for computes. I
>> use GCE machines.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticluster" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticluster+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] Re: SLURM: Unable to contact slurm controller

2018-12-20 Thread Orxan Shibliyev

Please disregard my previous post. I didn't even construct cluster but just
one instance. Sorry for taking time.

On Thu, Dec 20, 2018 at 1:44 PM Orxan Shibliyev 
wrote:

> For some reason I get "Unable to contact slurm controller (connect
> failure)" for any SLURM command. I constructed cluster as usual but this
> time it gives the mentioned error. What could be the reason?
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] SLURM: Unable to contact slurm controller

2018-12-20 Thread Orxan Shibliyev

For some reason I get "Unable to contact slurm controller (connect failure)"
for any SLURM command. I constructed cluster as usual but this time it
gives the mentioned error. What could be the reason?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] Elasticluster copies files before job submission

2018-12-21 Thread Orxan Shibliyev

When I run a job by setting number of nodes to 1 and number of tasks to 1
as well, naturally, only compute001 runs the job. Then I run 8-node job and
I see that output of first job which ran on compute001 are also available
on other nodes. Does elasticluster copy files among compute nodes before
job submission?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elasticluster] Elasticluster copies files before job submission

2018-12-21 Thread Orxan Shibliyev

So when a node produce a file, the file will be copied to all other nodes,
right? What if nodes produce a file with same name but different content?
Which file will be read by a node?

On Fri, Dec 21, 2018 at 3:57 PM Riccardo Murri 
wrote:

> Hello Orhan,
>
> > When I run a job by setting number of nodes to 1 and number of tasks to
> 1 as well, naturally, only compute001 runs the job. Then I run 8-node job
> and I see that output of first job which ran on compute001 are also
> available on other nodes. Does elasticluster copy files among compute nodes
> before job submission?
>
> Home directories in the cluster are shared -- jobs see the same files
> on all compute nodes across the cluster. (This makes it easier to read
> and exchange data, as you do not need to stage it prior to running the
> job.)
>
> Does this answer your question?
>
> Ciao,
> R
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticluster" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticluster+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[elasticluster] Shorten configuration time

Re: [elasticluster] Shorten configuration time

Re: [elasticluster] SLURM is not installed after cluster setup

Re: [elasticluster] SLURM is not installed after cluster setup

[elasticluster] sinfo gives wrong wrong number of nodes after resize

Re: [elasticluster] SLURM sbatch error

[elasticluster] SLURM sbatch error

[elasticluster] ERR_CONNECTION_REFUSED

[elasticluster] Error: Ensure the APT package cache is updated

[elasticluster] Different flavors for frontend and compute nodes

Re: [elasticluster] Different flavors for frontend and compute nodes

[elasticluster] Re: SLURM: Unable to contact slurm controller

[elasticluster] SLURM: Unable to contact slurm controller

[elasticluster] Elasticluster copies files before job submission

Re: [elasticluster] Elasticluster copies files before job submission

15 matches

Site Navigation

Mail list logo

Footer information