On 04/11/18 08:06, Steven D'Aprano wrote:
On Wed, Apr 11, 2018 at 02:21:17PM +1000, Chris Angelico wrote:
[...]
Yes, it will double the number of files. Actually quadruple it, if the
annotations and line numbers are in separate files too. But if most of
those extra files never need to be opened, then there's no cost to them.
And whatever extra cost there is, is amortized over the lifetime of the
interpreter.
Yes, if they are actually not needed. My question was about whether
that is truly valid.
We're never really going to know the affect on performance without
implementing and benchmarking the code. It might turn out that, to our
surprise, three quarters of the std lib relies on loading docstrings
during startup. But I doubt it.
Consider a very common use-case: an OS-provided
Python interpreter whose files are all owned by 'root'. Those will be
distributed with .pyc files for performance, but you don't want to
deprive the users of help() and anything else that needs docstrings
etc. So... are the docstrings lazily loaded or eagerly loaded?
What relevance is that they're owned by root?
If eagerly, you've doubled the number of file-open calls to initialize
the interpreter.
I do not understand why you think this is even an option. Has Serhiy
said something that I missed that makes this seem to be on the table?
That's not a rhetorical question -- I may have missed something. But I'm
sure he understands that doubling or quadrupling the number of file
operations during startup is not an optimization.
(Or quadrupled, if you need annotations and line
numbers and they're all separate.) If lazily, things are a lot more
complicated than the original description suggested, and there'd need
to be some semantic changes here.
What semantic change do you expect?
There's an implementation change, of course, but that's Serhiy's problem
to deal with and I'm sure that he has considered that. There should be
no semantic change. When you access obj.__doc__, then and only then are
the compiled docstrings for that module read from the disk.
I don't know the current implementation of .pyc files, but I like
Antoine's suggestion of laying it out in four separate areas (plus
header), each one marshalled:
code
docstrings
annotations
line numbers
Aside from code, which is mandatory, the three other sections could be
None to represent "not available", as is the case when you pass -00 to
the interpreter, or they could be some other sentinel that means "load
lazily from the appropriate file", or they could be the marshalled data
directly in place to support byte-code only libraries.
As for the in-memory data structures of objects themselves, I imagine
something like the __doc__ and __annotation__ slots pointing to a table
of strings, which is not initialised until you attempt to read from the
table. Or something -- don't pay too much attention to my wild guesses.
A __doc__ sentinel could even say something like "bytes 350--420 in the
original .py file, as UTF-8".
The bottom line is, is there some reason *aside from performance* to
avoid this? Because if the performance is worse, I'm sure Serhiy will be
the first to dump this idea.
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/