Hi,

I'm Johan and I'm working on a python based ERP for the brazilian
market[1], it's written in Python. As part of our software development
process we run pyflakes/pep8/pylint/unittests/coverage validation _before_
each commit can be integrated into the mainline.

The main bottleneck of this validation turns out to be pylint and I started
to investigate on how to use multiprocessing to speed that up.

Attached is a first take on just that. This creates a configurable number
of processes and uses them to split out the work on a per filename basis.

I've been working on 0.26.0 of Pylint, which is the version included with
Ubuntu 13.04. It seems that the part modified by this patch hasn't changed
significantly in trunk, but I can port over my work to a newer version if
there's interest.

The fact that the different processes are not sharing state between each
other probably may mean that certain checks are not going to work 100%, I
don't fully understand how pylint works internally, perhaps someone else
could chip in. Anyway, some numbers:

i3-3220 (1 socket, 2 cores, 4 hyperthreads)

Sample set 1: (37kloc, 92 files)
- unmodified 18.35s 0%
 - n_procs = 1 18.50s -1%
- n_procs = 2 11.29s +38%
 - n_procs = 3 10.96s +40%
 - n_procs = 4 10.85s +41%

Sample set 2: (156kloc, 762 files)
- unmodified 77.62s
 - n_procs = 1 82.33s -6%
 - n_procs = 2 47.68s +39%
 - n_procs = 3 46.04s +41%
 - n_procs = 4 45.19s +42%

Xeon(R) CPU E5410 (2 sockets, 8 cores, no hyperthreading)

Sample set 2: (156kloc, 762 files)

- unmodified 140.996s
 - n_procs = 4 48.675s 65%
 - n_procs = 8 36.323s 74%

The number seems fairly promising and introducing some simple errors in the
code base are properly being caught, I didn't quite test any errors that
require deep knowledge of other modules. The Xeon CPU is pretty old and
thus each core is quite a bit slower than the i3.

Running with n_processes = 1 seems to be a little bit slower, especially
when there's a lot of source code, I suspect that it's due to the fact that
the multiprocessing module is imported and perhaps the overhead of creating
a process. That can be mitigated to keep the old code path and only use
multiprocessing when n_processes >= 2.

Overall I think this is pretty positive for my use case, the total time of
running the validation steps went down from 7.5 minutes to 5.5 minutes or
so. The other parts of the validation can also be split out over different
processes, but that's for another time.

[1]: http://www.stoq.com.br/

-- 
Johan Dahlin

Attachment: pylint-multiprocessing.diff
Description: Binary data

_______________________________________________
code-quality mailing list
code-quality@python.org
https://mail.python.org/mailman/listinfo/code-quality

Reply via email to