Hi All,
I'd like to ask if it might be possible for someone to provide more
clarity as to how to actually configure a cluster to use HPX, and how to
run an HPX application (or at least the specific considerations for
setting up a cluster to run an application). As the above is general and
more than a little vague I'd like to clarify a bit:
- Although it is understood that a system will necessarily use some form
of job management system (as described in the docs under Getting
Started, where examples are provided for PBS and SLURM), there are still
aspects of the configuration which are not obvious (at least to a fairly
lay user such as myself). In particular, there are many configuration
options described in the manual
(http://stellar-group.github.io/hpx/docs/html/hpx/manual/init/commandline.html
for example):
--hpx:worker
--hpx:console
--hpx:connect
--hpx:run-agas-server
--hpx:run-hpx-main
--hpx:hpx arg
--hpx:agas arg
--hpx:run-agas-server-only
etc...
I admit that I haven't really tried to read through and understand the
code at all dealing with this part of the HPX runtime system. However,
I think that the documentation should better clarify (a) what,
specifically, do these options cause to have happen and considerations
about when to use particular options; (b) which of these options are
generally (or always) set by the job management system (by setting
environment variables, for example), and which need to be set by the
user, in general.
Although I cannot find the message off-hand right now, I distinctly
remember some messages send the through the mailing list in which
someone provided a particular set of command line arguments to a
questioner to help diagnose some problems running their code. In that
case, I seem to recall that one machine was set to only run the agas (or
at least, a specific machine in the cluster was identified as hosting
the agas), and some other arguments, which I cannot recall right now.
Should there always be a specific machine used for the AGAS? Are the
instances where it is required, instances where it is recommended,
and/or instances where it is not at all necessary?
- Perhaps this is a bit more of a dumb question, but I'd rather
understand things well than not ask... Under the configuration /
configuration default settings
(http://stellar-group.github.io/hpx/docs/html/hpx/manual/init/configuration/config_defaults.html),
there are obviously a number of options that are set and/or described
here. For example:
- under " /*The|[hpx|] Configuration Section", */there are various
options here which can be set -- such as hpx.location, hpx.localities,
hpx.os_threads. It would seem reasonable that many of these are set
automatically by the software based on how the program was invoked
through the job management system. Others such as stack size would not
come from the job management system (unless it is specifically passed as
a command-line argument, where possible).
- under [hpx.parcel] configuration section, again there are a
number of options such as the parcel address, port, etc., Which of these
are automatically set, which would need to be explicitly set?
- As I understand from the description of the property
hpx.parcel.mpi.enable this is automatically detected at startup, as
long as the application itself was started within a parallel MPI
environment. I know this is somewhat outside the scope of the
documentation itself, although having a more comprehensive set of
examples using PBS and SLURM for different cases would be greatly
appreciated. Although that should primarily be addressed in the SLURM
documentation, some additional examples of running programs would be
extremely helpful to anyone who does not have the benefit of extensive
experience.
- looking at the [hpx.agas] section, it isn't clear to me how this
should best be configured. This was partially noted above in the first
half of the question dealing with command-line arguments, but not
completely. Obviously, the default address 127.0.0.1 would only work
for a program running only on the one single locality. There are a
number of things that are not obvious to me here: (a) is the
preprocessor constant applicable when HPX is compiled, or when an
application using HPX is compiled? (b) Should there generally be a
single locality/cluster node which is set-aside to be the AGAS server?
(c) how is the configuration option hpx.agas.service_mode supposed to be
used? Perhaps, it seems to be the case that, in general, a single AGAS
server should be selected for a cluster in advance, and that the
system-wide ini config files should set this to be "bootstrap" on that
particular machine and "hosted" on other machines. Is this the case or
have I misunderstood?
I recognize that the documentation is, of course, a work-in-progress,
and although it is quite impressive and clear on a lot of points, there
are some other points, which I find to be rather unclear. Perhaps this
is related to the fact that most of the users of hpx have the benefit of
peers/colleagues/sys admins which can provide this information
informally. Unfortunately, right now, I do not personally have that
benefit, and I'm sure that there are others in my situation presently,
and there will undoubtedly become more and more of us as use of HPX
becomes increasingly widespread.
I would be willing, in general, to assist with improving documentation
to the best of my abilities. If someone can help me to better
understand these types of issues (or other areas that have already been
identified as requiring better documentation), I would be willing to
write up the documentation in a more articulate form, that can be
published online.
Thanks and regards,
Shmuel
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users