Hi, first I'd like to qualify myself as a student of Virtual Machine implementations, not working (yet) on PyPy itself, and aware of some HPC issues at a basic level. Still, I'd like to help pinpointing the discussion.
On Mon, Jan 12, 2009 at 12:10, Guillem Borrell i Nogueras <[email protected]> wrote: > Hi again > Let's discuss the details. > I'll try to explain why I've thought about pypy when planning the conference > sessions. > My work as Computational Fluid Dynamics researcher is intimately related to > supercomputing for obvious reasons. Most of applications we work on are fine > tuned codes with pieces that are more than twenty years old. They are rock > solid, fortan implemented, and run for hours in clusters of thousand of > computing nodes. > Since the last couple of years computer architectures are becoming more and > more > complex. I'm playing with the cell processor lately and that little bastard is > causing me real pain. While programming is easier every day, supercomputing is > harder and harder. Think about an arhitecture like the roadrunner, AMD > Opteron > PPU with PowerPC SPU... Two assemblers in one chip! > Talking with Stanley Ahalt (Ohio Supercomputing Center) about a year ago he > called that the "software gap". In computing, as times goes, low performance > is > easier but high performance is harder. And that gap gets wider. Platform SDK > are > helpful but they are not a huge leap. > I've always thought that virtual machines could help supercomputing like they > have helped grid and cloud computing. This is the point where I need someone > to > proof if I am wright or wrong. Pypy is the most versatile, albeit complex, > dynamic language implementation. I've been following the project during the > last > year and a half or so and I am impressed. I've thought that you could have a > vision on how interpreted languages and virtual machines can help managing > complexity. > In addition, most of postprocessing tools are written in matlab, an > interpreted > language. Running not-so-high performance tasks in a workstation efficiently > is > sometimes as important as running a simulation in a 12000 node supercomputer. > It > yould be nice if someone would remind the audience that matlab is not the only > suitable (or best) tool for that job. This is IMHO an issue with the expressivity of Matlab & Python. There are Python libraries for that, but Python has not a Domain-Specific syntax I guess - you'd need to spell out method names sometimes, instead of using / and ./. Slicing is something already existing, though. However, JIT-compiled Python would have the huge advantage to make also loops more efficient, instead of forcing the user to write all loops in terms of matrix parallel operations. That was the biggest slowdown in Matlab development I experienced. And I remember fixing that in somebody else's program, getting from 12 hours to 1 minute of runtime. 720x speedup. Interpreters can be much faster than that, even CPython is already faster in that area. The advantage of vectorized loops is that they easily use optimized BLAS routines, maybe SSE/Cell processor based ones. > I'm very interested in your comments. Now, the first question is: do you have Python to be faster than Matlab or faster than Fortran? VMs can be faster than C++ (for instance, static C++ can't inline virtual methods). That makes a huge difference. I'm unaware of why one can't use its Fortran source on Cell. OK, I can wildly guess that having two specialized microprocessors, one might like having an automatic parallelizer to split certain operations to one and certain to another? Can you make some better example? The obvious ones are automatic vectorization and automatic parallelization, but I'm not aware of any VM-specific research on that topic. And automatic parallelization is a quite difficult topic anyway, as far as I know. In other words: is there any special advantage given by adaptive optimization (even profile-based one) that static optimizations (like the ones done by ICC) cannot match? None is obvious for vectorization, automatic tuning of sizes comes to mind for auto parallelization. In general it is well known that the more a language is high-level, the more informations the compiler has to optimize it, but also the more the language has fancy features not trivial to optimize. Actually, Fortran is better than C exactly because is more high-level. Most C code assumes for instance GCC option -fno-strict-aliasing, which is contrary to the language semantics and forbids many interesting optimizations. See this website for information (note: I found the link somewhere else, but be careful that according to Google&Firefox, the server is under control of hackers and is spreading viruses; I run Linux and I'm safe, YMMV): http://www.cellperformance.com/mike_acton/2006/06/understanding_strict_aliasing.html Having said that, the point is to understand which optimizations do you need to perform manually and could be performed automatically by a VM? Note that your Fortran compiler probably has a far better static code optimizer. If you write plain Fortran code in Python, it's going to be much slower (like if no optimization were present) until dataflow analysis, register allocation, instruction scheduling etc. will be implemented in PyPy, after all the rest is finished. It's just a matter of implementation costs, but it is huge. Where VM shine is when they can do optimizations unavailable to static compilers, like adaptive optimizations. Inlining of virtual method is an example, but automatic prefetching from memory into cache (by adding SSE prefetch instructions) is another on which research has been done, for instance, just to make an example. Regards -- Paolo Giarrusso _______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
