On Fri, Feb 10, 2012 at 1:05 PM, Brett Cannon <br...@python.org> wrote:

>
>
> On Thu, Feb 9, 2012 at 17:00, PJ Eby <p...@telecommunity.com> wrote:
>
>> I did some crude timeit tests on frozenset(listdir()) and trapping failed
>> stat calls.  It looks like, for a Windows directory the size of the 2.7
>> stdlib, you need about four *failed* import attempts to overcome the
>> initial caching cost, or about 8 successful bytecode imports.  (For Linux,
>> you might need to double these numbers; my tests showed a different ratio
>> there, perhaps due to the Linux stdib I tested having nearly twice as many
>> directory entries as the directory I tested on Windows!)
>>
>
>> However, the numbers are much better for application directories than for
>> the stdlib, since they are located earlier on sys.path.  Every successful
>> stdlib import in an application is equal to one failed import attempt for
>> every preceding directory on sys.path, so as long as the average directory
>> on sys.path isn't vastly larger than the stdlib, and the average
>> application imports at least four modules from the stdlib (on Windows, or 8
>> on Linux), there would be a net performance gain for the application as a
>> whole.  (That is, there'd be an improved per-sys.path entry import time for
>> stdlib modules, even if not for any application modules.)
>>
>
> Does this comment take into account the number of modules required to load
> the interpreter to begin with? That's already like 48 modules loaded by
> Python 3.2 as it is.
>

I didn't count those, no.  So, if they're loaded from disk *after*
importlib is initialized, then they should pay off the cost of caching even
fairly large directories that appear earlier on sys.path than the stdlib.
 We still need to know about NFS and other ratios, though...  I still worry
that people with more extreme directory sizes or slow-access situations
will run into even worse trouble than they have now.



> First is that if this were used on Windows or OS X (i.e. the OSs we
> support that typically have case-insensitive filesystems), then this
> approach would be a massive gain as we already call os.listdir() when
> PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat
> calls and add in 5 listdir() calls and that's what you get on Windows and
> OS X right now. Linux doesn't have this check so you would still be
> potentially paying a penalty there.
>

Wow.  That means it'd always be a win for pre-stdlib sys.path entries,
because any successful stdlib import equals a failed pre-stdlib lookup.
 (Of course, that's just saving some of the overhead that's been *added* by
importlib, not a new gain, but still...)


Second is variance in filesystems. Are we guaranteed that the stat of a
> directory is updated before a file change is made?
>

Not quite sure what you mean here.  The directory stat is used to ensure
that new files haven't been added, old ones removed, or existing ones
renamed.  Changes to the files themselves shouldn't factor in, should they?



> Else there is a small race condition there which would suck. We also have
> the issue of granularity; Antoine has already had to add the source file
> size to .pyc files in Python 3.3 to combat crappy mtime granularity when
> generating bytecode. If we get file mod -> import -> file mod -> import,
> are we guaranteed that the second import will know there was a modification
> if the first three steps occur fast enough to fit within the granularity of
> an mtime value?
>

Again, I'm not sure how this relates.  Automatic code reloaders monitor
individual files that have been previously imported, so the directory
timestamps aren't relevant.

Of course, I could be confused here.  Are you saying that if somebody makes
a new .py file and saves it, that it'll be possible to import it before
it's finished being written?  If so, that could happen already, and again
caching the directory doesn't make any difference.

Alternately, you could have a situation where the file is deleted after we
load the listdir(), but in that case the open will fail and we can fall
back...  heck, we can even force resetting the cache in that event.


I was going to say something about __pycache__, but it actually doesn't
> affect this. Since you would have to stat the directory anyway, you might
> as well just stat directory for the file you want to keep it simple. Only
> if you consider __pycache__ to be immutable except for what the interpreter
> puts in that directory during execution could you optimize that step (in
> which case you can stat the directory once and never care again as the set
> would be just updated by import whenever a new .pyc file was written).
>
> Having said all of this, implementing this idea would be trivial using
> importlib if you don't try to optimize the __pycache__ case. It's just a
> question of whether people are comfortable with the semantic change to
> import. This could also be made into something that was in importlib for
> people to use when desired if we are too worried about semantic changes.
>

Yep.  I was actually thinking this could be backported to 2.x, even without
importlib, as a module to be imported in sitecustomize or via a .pth file.
 All it needs is a path hook, after all, and a subclass of the pkgutil
importer to test it.  And if we can get some people with huge NFS libraries
and/or zillions of .egg directories on sys.path to test it, we could find
out whether it's a win, lose, or draw for those scenarios.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to