Re: [hpx-users] Improving documentation about configuring cluster for HPX and for running an HPX application

Hartmut Kaiser Sun, 28 Aug 2016 11:12:01 -0700

Michael,

> I'd like to ask if it might be possible for someone to provide more
> clarity as to how to actually configure a cluster to use HPX, and how to
> run an HPX application (or at least the specific considerations for
> setting up a cluster to run an application).  As the above is general and
> more than a little vague I'd like to clarify a bit:
> - Although it is understood that a system will necessarily use some form
> of job management system (as described in the docs under Getting Started,
> where examples are provided for PBS and SLURM), there are still aspects of
> the configuration which are not obvious (at least to a fairly lay user
> such as myself).  In particular, there are many configuration options
> described in the manual (http://stellar-
> group.github.io/hpx/docs/html/hpx/manual/init/commandline.html for
> example):
> --hpx:worker
> --hpx:console
> --hpx:connect
> --hpx:run-agas-server
> --hpx:run-hpx-main
> --hpx:hpx arg
> --hpx:agas arg
> --hpx:run-agas-server-only
> etc...


All of those shouldn't be necessary if you run your application through the 
batch scheduler. 

There is a stackoverflow question answering most of this question: 
http://stackoverflow.com/questions/35367816/hpx-minimal-two-node-example-set-up,
 
which also explains what those options are for.

Especially if you use MPI, running the application should be as simple as 

    mpirun (possibly options to MPI) your_app you_app_options

If run from a batch system you usually shouldn't need to specify any options to 
mpirun.

> I admit that I haven't really tried to read through and understand the
> code at all dealing with this part of the HPX runtime system.  However, I
> think that the documentation should better clarify (a) what, specifically,
> do these options cause to have happen and considerations about when  to
> use particular options; 

See above.

> (b) which of these options are generally (or
> always) set by the job management system (by setting environment
> variables, for example), and which need to be set by the user, in general.

This is a good point. 

When using TCP/IP, all HPX localities extract the necessary information from 
the batch environment, which is mainly

- the nodes the localities run on (implicitly this gives us the number of 
localities)
- the number of localities run per node
- the number of cores to utilize per locality

Those settings are extracted from different environment variables specific to 
the batch scheduler. From looking at 
https://github.com/STEllAR-GROUP/hpx/blob/master/src/util/batch_environments/pbs_environment.cpp
 you can see for instance, that in the case of PBS we look at the following 
environment variables: PBS_NODENUM, PBS_NUM_PPN, PBS_NODEFILE (support for 
other batch schedulers is implemented in files available from the same 
directory). HPX uses this information to determine the TCP/IP addresses/ports 
to use to let the localities communicate.

For MPI, all of the batch-system information is extracted by mpirun as well, 
which is setting up (running) the executables (localities) on different nodes. 
HPX additionally looks at the environment variables as necessary. If you care 
for really detailed information on how MPI is set up inside each of the 
localities, here is the place to look: 
https://github.com/STEllAR-GROUP/hpx/blob/master/plugins/parcelport/mpi/mpi_environment.cpp.
 

> Although I cannot find the message off-hand right now, I distinctly
> remember some messages send the through the mailing list in which someone
> provided a particular set of command line arguments to a questioner to
> help diagnose some problems running their code.  In that case, I seem to
> recall that one machine was set to only run the agas (or at least, a
> specific machine in the cluster was identified as hosting the agas), and
> some other arguments, which I cannot recall right now.  Should there
> always be a specific machine used for the AGAS?  Are the instances where
> it is required, instances where it is recommended, and/or instances where
> it is not at all necessary?

AGAS is a distributed system (across all localities), but HPX relies during 
startup on an initial AGAS instance being available in locality zero (MPI rank 
0).

> - Perhaps this is a bit more of a dumb question, but I'd rather understand
> things well than not ask...  Under the configuration / configuration
> default settings (http://stellar-
> group.github.io/hpx/docs/html/hpx/manual/init/configuration/config_default
> s.html), there are obviously a number of options that are set and/or
> described here.  For example:
>     - under " The [hpx] Configuration Section", there are various options
> here which can be set -- such as hpx.location, hpx.localities,
> hpx.os_threads.  It would seem reasonable that many of these are set
> automatically by the software based on how the program was invoked through
> the job management system.  Others such as stack size would not come from
> the job management system (unless it is specifically passed as a command-
> line argument, where possible).

Those are not meant for you to set, but rather for you to query at runtime if 
needed.

>     - under [hpx.parcel] configuration section, again there are a number
> of options such as the parcel address, port, etc., Which of these are
> automatically set, which would need to be explicitly set?

Same here, most of this is meant for introspection, no need to set anything in 
the default case.

>     - As I understand from the description of the property
> hpx.parcel.mpi.enable   this is automatically detected at startup, as long
> as the application itself was started within a parallel MPI
> environment.  I know this is somewhat outside the scope of the
> documentation itself, although having a more comprehensive set of examples
> using PBS and SLURM for different cases would be greatly
> appreciated.  Although that should primarily be addressed in the SLURM
> documentation, some additional examples of running programs would be
> extremely helpful to anyone who does not have the benefit of extensive
> experience.

We have started to put together some information for various machines here: 
https://github.com/STEllAR-GROUP/hpx/wiki. Feel free to ask if you need more 
details. 

>     - looking at the [hpx.agas] section, it isn't clear to me how this
> should best be configured.  This was partially noted above in the first
> half of the question dealing with command-line arguments, but not
> completely.  Obviously, the default address 127.0.0.1 would only work for
> a program running only on the one single locality.   There are a number of
> things that are not obvious to me here: (a) is the preprocessor constant
> applicable when HPX is compiled, or when an application using HPX is
> compiled? 

Preprocessor constants are always applied at compile time.

> (b) Should there generally be a single locality/cluster node
> which is set-aside to be the AGAS server? 

Locality zero is used by efault for this.

> (c) how is the configuration
> option hpx.agas.service_mode supposed to be used?  Perhaps, it seems to be
> the case that, in general, a single AGAS server should be selected for a
> cluster in advance, and that the system-wide ini config files should set
> this to be "bootstrap" on that   particular machine and "hosted" on other
> machines.  Is this the case or have I misunderstood?

This is meant for introspection as well.

Overall, I think we need to add more information to those settings explaining 
that they are (usually) not meant to be directly set by the user.

> I recognize that the documentation is, of course, a work-in-progress, and
> although it is quite impressive and clear on a lot of points, there are
> some other points, which I find to be rather unclear.  Perhaps this is
> related to the fact that most of the users of hpx have the benefit of
> peers/colleagues/sys admins which can provide this information
> informally.  Unfortunately, right now, I do not personally have that
> benefit, and I'm sure that there are others in my situation presently, and
> there will undoubtedly become more and more of us as use of HPX becomes
> increasingly widespread.

We fully appreciate your need for more information. Writing documentation is 
tough, however. So any input on wat information you'd like to see is much 
appreciated.

> I would be willing, in general, to assist with improving documentation to
> the best of my abilities.  If someone can help me to better understand
> these types of issues (or other areas that have already been identified as
> requiring better documentation), I would be willing to write up the
> documentation in a more articulate form, that can be published online.

Perfect! I'd very much like you helping out with the documentation!

HTH
Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu



_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Improving documentation about configuring cluster for HPX and for running an HPX application

Reply via email to