On Fri, Feb 10, 2012 at 15:07, PJ Eby <p...@telecommunity.com> wrote:

> On Fri, Feb 10, 2012 at 1:05 PM, Brett Cannon <br...@python.org> wrote:
>
>>
>>
>> On Thu, Feb 9, 2012 at 17:00, PJ Eby <p...@telecommunity.com> wrote:
>>
>>> I did some crude timeit tests on frozenset(listdir()) and trapping
>>> failed stat calls.  It looks like, for a Windows directory the size of the
>>> 2.7 stdlib, you need about four *failed* import attempts to overcome the
>>> initial caching cost, or about 8 successful bytecode imports.  (For Linux,
>>> you might need to double these numbers; my tests showed a different ratio
>>> there, perhaps due to the Linux stdib I tested having nearly twice as many
>>> directory entries as the directory I tested on Windows!)
>>>
>>
>>> However, the numbers are much better for application directories than
>>> for the stdlib, since they are located earlier on sys.path.  Every
>>> successful stdlib import in an application is equal to one failed import
>>> attempt for every preceding directory on sys.path, so as long as the
>>> average directory on sys.path isn't vastly larger than the stdlib, and the
>>> average application imports at least four modules from the stdlib (on
>>> Windows, or 8 on Linux), there would be a net performance gain for the
>>> application as a whole.  (That is, there'd be an improved per-sys.path
>>> entry import time for stdlib modules, even if not for any application
>>> modules.)
>>>
>>
>> Does this comment take into account the number of modules required to
>> load the interpreter to begin with? That's already like 48 modules loaded
>> by Python 3.2 as it is.
>>
>
> I didn't count those, no.  So, if they're loaded from disk *after*
> importlib is initialized, then they should pay off the cost of caching even
> fairly large directories that appear earlier on sys.path than the stdlib.
>  We still need to know about NFS and other ratios, though...  I still worry
> that people with more extreme directory sizes or slow-access situations
> will run into even worse trouble than they have now.
>

It's possible. No way to make it work for everyone. This is why I didn't
worry about some crazy perf optimization.


>
>
>
>> First is that if this were used on Windows or OS X (i.e. the OSs we
>> support that typically have case-insensitive filesystems), then this
>> approach would be a massive gain as we already call os.listdir() when
>> PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat
>> calls and add in 5 listdir() calls and that's what you get on Windows and
>> OS X right now. Linux doesn't have this check so you would still be
>> potentially paying a penalty there.
>>
>
> Wow.  That means it'd always be a win for pre-stdlib sys.path entries,
> because any successful stdlib import equals a failed pre-stdlib lookup.
>  (Of course, that's just saving some of the overhead that's been *added* by
> importlib, not a new gain, but still...)
>

How so? import.c does a listdir() as well (this is not special to
importlib).


>
>
> Second is variance in filesystems. Are we guaranteed that the stat of a
>> directory is updated before a file change is made?
>>
>
> Not quite sure what you mean here.  The directory stat is used to ensure
> that new files haven't been added, old ones removed, or existing ones
> renamed.  Changes to the files themselves shouldn't factor in, should they?
>

Changes in any fashion to the directory. Do filesystems atomically update
the mtime of a directory when they commit a change? Otherwise we have a
potential race condition.


>
>
>
>> Else there is a small race condition there which would suck. We also have
>> the issue of granularity; Antoine has already had to add the source file
>> size to .pyc files in Python 3.3 to combat crappy mtime granularity when
>> generating bytecode. If we get file mod -> import -> file mod -> import,
>> are we guaranteed that the second import will know there was a modification
>> if the first three steps occur fast enough to fit within the granularity of
>> an mtime value?
>>
>
> Again, I'm not sure how this relates.  Automatic code reloaders monitor
> individual files that have been previously imported, so the directory
> timestamps aren't relevant.
>
>
Don't care about automatic reloaders. I'm just asking about the case where
the mtime granularity is coarse enough to allow for a directory change, an
import to execute, and then another directory change to occur all within a
single mtime increment. That would lead to the set cache to be out of date.


> Of course, I could be confused here.  Are you saying that if somebody
> makes a new .py file and saves it, that it'll be possible to import it
> before it's finished being written?  If so, that could happen already, and
> again caching the directory doesn't make any difference.
>
> Alternately, you could have a situation where the file is deleted after we
> load the listdir(), but in that case the open will fail and we can fall
> back...  heck, we can even force resetting the cache in that event.
>
>
> I was going to say something about __pycache__, but it actually doesn't
>> affect this. Since you would have to stat the directory anyway, you might
>> as well just stat directory for the file you want to keep it simple. Only
>> if you consider __pycache__ to be immutable except for what the interpreter
>> puts in that directory during execution could you optimize that step (in
>> which case you can stat the directory once and never care again as the set
>> would be just updated by import whenever a new .pyc file was written).
>>
>> Having said all of this, implementing this idea would be trivial using
>> importlib if you don't try to optimize the __pycache__ case. It's just a
>> question of whether people are comfortable with the semantic change to
>> import. This could also be made into something that was in importlib for
>> people to use when desired if we are too worried about semantic changes.
>>
>
> Yep.  I was actually thinking this could be backported to 2.x, even
> without importlib, as a module to be imported in sitecustomize or via a
> .pth file.  All it needs is a path hook, after all, and a subclass of the
> pkgutil importer to test it.  And if we can get some people with huge NFS
> libraries and/or zillions of .egg directories on sys.path to test it, we
> could find out whether it's a win, lose, or draw for those scenarios.
>

You can do that if you want, obviously I don't want to bother since it
won't make it into Python 2.7.

>
>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to