[gentoo-python] The future of parallel phases?

Michał Górny Sat, 22 Nov 2014 14:07:43 -0800

Hello, Python team and other nice people.

I'd like to discuss the topic of parallel runs again. While it sounded
like a good idea at first, I have my doubts now. I'll try to shortly
describe the implementation, then recollect the advantages
and disadvantages of it.



Implementation
--------------

Right now, parallel runs are feature of multibuild.eclass which in turn
uses multiprocessing.eclass. It's some hacky implementation in bash but
it works. However, calling 'die' inside such implementation is illegal
per PMS and the Council is not interested in changing that even though
we're doing that a lot :).

The parallel support is implemented in python-r1 through
python_parallel_foreach_impl function. This function in turn is used to
implement parallel running of sub-phases in distutils-r1.


Adv. and disadv. of parallel phases now
---------------------------------------

Advantages:

- speedup of non-parallel build tasks -- compiling Python modules,
  extensions (before Python 3.4 [?]), running 2to3. The latter uses to
  take a lot of CPU time while utilizing only one core on a modern CPU.
  Running it in parallel for few impls makes it possible to utilize
  full power of the CPU.

- speedup of PyPy phase runs -- PyPy and PyPy3 take quite long to
  start. By spawning their phases first and in parallel to CPython
  runs, we can speed the build up a bit. The idea is that
  implementations that usually take longer to build are spawned first
  so that the machine is kept multi-core busy as long as possible.

- finding of silly assumptions in build systems -- we have a lot of
  build systems that write in random locations and expect files not to
  be touched by anything else.

Disadvantages:

- conflict with parallel parts of build -- I think Python 3.4's
  distutils is capable of building extensions in parallel [can we
  backport that?]. The same goes for nosetests and possibly some other
  stuff.

- possibility of high resource usage -- this especially applies to
  tests which aren't made with assumption that someone will be running,
  say, 4 instances of them in parallel.

- necessity of fighting build system bugs -- it's rather common that
  tests and builds write to files in sourcedir or tempdir without
  proper unique naming. Long story short, we need to workaround that
  stuff a lot to get the tests not to fail randomly, and the build to
  install correct files (and e.g. not mix implementations).

- some developers are surprised that variables set inside sub-phases
  are not preserved in global scope (due to subshell).


What if we disabled it?
-----------------------

Advantages:

- the eclass becomes a small bit simpler, and loses the dependency on
  multiprocessing (well, it will still be inherited implicitly but not
  used).

- developers no longer have to fix all the upstream build system
  failures.

- resource-consuming and parallel parts of build no longer have to be
  hacked to avoid issues with multiprocessing.

- we comply to PMS again.

Disadvantages:

- 2to3 and pure Python module build/install steps will be noticeably
  slower and less efficient (esp. noticeable for PyPy and PyPy3).

- some ebuilds may have to be modified because developers assumed that
  changes (global vars, working directories) from within sub-phase will
  not affect the successive phases.


What are your thoughts?

-- 
Best regards,
Michał Górny

signature.asc
Description: PGP signature

[gentoo-python] The future of parallel phases?

Reply via email to