Hi Dan,

On 26/08/2016 08:53, Dan Ibanez wrote:
> I apologize in advance if this is the wrong forum
> for this question, but I recently read that PyFR
> is a Gordon Bell finalist for having run on the Titan machine.
> I also recall a year or so ago hearing about significant
> difficulties in running Python at large scale on supercomputers.
> The specific issue I recall has to do with Python's
> module import doing filesystem searches from all
> compute nodes at once, stressing the filesystem.
> Have issues like this been resolved in a general sense,
> or via workarounds ?
> What is the current outlook for others to develop Python code
> that can scale on leadership-class supercomputers ?
> Anyway, congratulations and thank you in advance for
> any clarifications you can provide.


The problems often encountered when attempting to use Python on large
computing clusters stem from the fact that parallel file systems are
ill-equipped to deal with the large number of metadata operations that
are generated by both the Python interpreter and the dynamic loader.

These issues are perennial and are unlikely to be resolved in the near
term; handling millions of metadata IOPS is an extremely challenging
problem.

Nevertheless, there are range of means by which these issues can be
mitigated.  The first is to ensure that as much of the stack (Python +
modules + application) as possible is pushed out to the nodes
themselves.  One means of accomplishing this is through seeing that the
relevant dependencies are made available as modules.  Further, some
systems provide means through which a user can arrange to have
application data pushed directly onto the local disks of the compute
nodes.  Doing this ensures that much of the start-up related I/O is
purely local; and hence scalable.

For portions of the stack where this is either not possible or
impractical (say because the application code is undergoing constant
revision and hence the overheads associated with getting it built as a
module are substantial) the next best thing is to pack all remaining
portions into a compressed archive.  The job script is then instructed
to copy this archive from the parallel file system onto a region of
scratch space on each compute node.  This can be /tmp or something more
fancy. The archive can then be extracted and the application run from
there.  Here the majority of the I/O operations are to read a reasonably
sized file—something that is much better suited to the wants of a
parallel file system.  With this approach care must be taken if there
are to be multiple jobs per node; however there are certainly ways of
managing this.

A combination of these two approaches allows a Python application to
scale nearly indefinitely without overloading the metadata server.
Crucially, neither require LD_PRELOAD type hacks or any kind of special
software; they can be deployed directly on the majority of clusters
which we have encountered.

Regards, Freddie.

-- 
You received this message because you are subscribed to the Google Groups "PyFR 
Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send an email to [email protected].
Visit this group at https://groups.google.com/group/pyfrmailinglist.
For more options, visit https://groups.google.com/d/optout.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to