Folks,

If it’s out of line in some way for me to make this comment on this list, let 
me know and I’ll stop! But I do feel strongly about one issue and think it’s 
worth mentioning, so here goes.

I read the "A better story for multi-core Python” with great interest because 
the GIL has actually been a major hindrance to me. I know that for many uses, 
it’s a non-issue. But it was for me.

My situation was that I had a huge (technically mutable, but unchanging) data 
structure which needed a lot of analysis. CPU time was a major factor — things 
took days to run. But even so, my time as a programmer was much more important 
than CPU time. I needed to prototype different algorithms very quickly. Even 
Cython would have slowed me down too much. Also, I had a lot of reason to want 
to make use of the many great statistical functions in SciPy, so Python was an 
excellent choice for me in that way. 

So, even though pure Python might not be the right choice for this program in a 
production environment, it was the right choice for me at the time. And, if I 
could have accessed as many cores as I wanted, it may have been good enough in 
production too. But my work was hampered by one thing:

There was a huge data structure that all the analysis needed to access. Using a 
database would have slowed things down too much. Ideally, I needed to access 
this same structure from many cores at once. On a Power8 system, for example, 
with its larger number of cores, performance may well have been good enough for 
production. In any case, my experimentation and prototyping would have gone 
more quickly with more cores.

But this data structure was simply too big. Replicating it in different 
processes used memory far too quickly and was the limiting factor on the number 
of cores I could use. (I could fork with the big data structure already in 
memory, but copy-on-write issues due to reference counting caused multiple 
copies to exist anyway.)

So, one thing I am hoping comes out of any effort in the “A better story” 
direction would be a way to share large data structures between processes. Two 
possible solutions:

1) More the reference counts away from data structures, so copy-on-write isn’t 
an issue. That sounds like a lot of work — I have no idea whether it’s 
practical. It has been mentioned in the “A better story” discussion, but I 
wanted to bring it up again in the context of my specific use-case. Also, it 
seems worth reiterating that even though copy-on-write forking is a Unix thing, 
the midipix project appears to bring it to Windows as well. (http://midipix.org)

2) Have a mode where a particular data structure is not reference counted or 
garbage collected. The programmer would be entirely responsible for manually 
calling del on the structure if he wants to free that memory. I would imagine 
this would be controversial because Python is currently designed in a very 
different way. However, I see no actual risk if one were to use an 
@manual_memory_management decorator or some technique like that to make it very 
clear that the programmer is taking responsibility. I.e., in general, 
information sharing between subinterpreters would occur through message 
passing. But there would be the option of the programmer taking responsibility 
of memory management for a particular structure. In my case, the amount of work 
required for this would have been approximately zero — once the structure was 
created, it was needed for the lifetime of the process. 

Under this second solution, there would be little need to actually remove the 
reference counts from the data structures — they just wouldn’t be accessed. 
Maybe it’s not a practical solution, if only because of the overhead of Python 
needing to check whether a given structure is manually managed or not. In that 
case, the first solution makes more sense.

In any case I thought this was worth mentioning,  because it has been a real 
problem for me, and I assume it has been a real problem for other people as 
well. If a solution is both possible and practical, that would be great.

Thank you for listening,
Gary


-- 

Gary Robinson
gary...@me.com
http://www.garyrobinson.net

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to