On Fri, Feb 10, 2012 at 15:07, PJ Eby <p...@telecommunity.com> wrote:
> On Fri, Feb 10, 2012 at 1:05 PM, Brett Cannon <br...@python.org> wrote: > >> >> >> On Thu, Feb 9, 2012 at 17:00, PJ Eby <p...@telecommunity.com> wrote: >> >>> I did some crude timeit tests on frozenset(listdir()) and trapping >>> failed stat calls. It looks like, for a Windows directory the size of the >>> 2.7 stdlib, you need about four *failed* import attempts to overcome the >>> initial caching cost, or about 8 successful bytecode imports. (For Linux, >>> you might need to double these numbers; my tests showed a different ratio >>> there, perhaps due to the Linux stdib I tested having nearly twice as many >>> directory entries as the directory I tested on Windows!) >>> >> >>> However, the numbers are much better for application directories than >>> for the stdlib, since they are located earlier on sys.path. Every >>> successful stdlib import in an application is equal to one failed import >>> attempt for every preceding directory on sys.path, so as long as the >>> average directory on sys.path isn't vastly larger than the stdlib, and the >>> average application imports at least four modules from the stdlib (on >>> Windows, or 8 on Linux), there would be a net performance gain for the >>> application as a whole. (That is, there'd be an improved per-sys.path >>> entry import time for stdlib modules, even if not for any application >>> modules.) >>> >> >> Does this comment take into account the number of modules required to >> load the interpreter to begin with? That's already like 48 modules loaded >> by Python 3.2 as it is. >> > > I didn't count those, no. So, if they're loaded from disk *after* > importlib is initialized, then they should pay off the cost of caching even > fairly large directories that appear earlier on sys.path than the stdlib. > We still need to know about NFS and other ratios, though... I still worry > that people with more extreme directory sizes or slow-access situations > will run into even worse trouble than they have now. > It's possible. No way to make it work for everyone. This is why I didn't worry about some crazy perf optimization. > > > >> First is that if this were used on Windows or OS X (i.e. the OSs we >> support that typically have case-insensitive filesystems), then this >> approach would be a massive gain as we already call os.listdir() when >> PYTHONCASEOK isn't defined to check case-sensitivity; take your 5 stat >> calls and add in 5 listdir() calls and that's what you get on Windows and >> OS X right now. Linux doesn't have this check so you would still be >> potentially paying a penalty there. >> > > Wow. That means it'd always be a win for pre-stdlib sys.path entries, > because any successful stdlib import equals a failed pre-stdlib lookup. > (Of course, that's just saving some of the overhead that's been *added* by > importlib, not a new gain, but still...) > How so? import.c does a listdir() as well (this is not special to importlib). > > > Second is variance in filesystems. Are we guaranteed that the stat of a >> directory is updated before a file change is made? >> > > Not quite sure what you mean here. The directory stat is used to ensure > that new files haven't been added, old ones removed, or existing ones > renamed. Changes to the files themselves shouldn't factor in, should they? > Changes in any fashion to the directory. Do filesystems atomically update the mtime of a directory when they commit a change? Otherwise we have a potential race condition. > > > >> Else there is a small race condition there which would suck. We also have >> the issue of granularity; Antoine has already had to add the source file >> size to .pyc files in Python 3.3 to combat crappy mtime granularity when >> generating bytecode. If we get file mod -> import -> file mod -> import, >> are we guaranteed that the second import will know there was a modification >> if the first three steps occur fast enough to fit within the granularity of >> an mtime value? >> > > Again, I'm not sure how this relates. Automatic code reloaders monitor > individual files that have been previously imported, so the directory > timestamps aren't relevant. > > Don't care about automatic reloaders. I'm just asking about the case where the mtime granularity is coarse enough to allow for a directory change, an import to execute, and then another directory change to occur all within a single mtime increment. That would lead to the set cache to be out of date. > Of course, I could be confused here. Are you saying that if somebody > makes a new .py file and saves it, that it'll be possible to import it > before it's finished being written? If so, that could happen already, and > again caching the directory doesn't make any difference. > > Alternately, you could have a situation where the file is deleted after we > load the listdir(), but in that case the open will fail and we can fall > back... heck, we can even force resetting the cache in that event. > > > I was going to say something about __pycache__, but it actually doesn't >> affect this. Since you would have to stat the directory anyway, you might >> as well just stat directory for the file you want to keep it simple. Only >> if you consider __pycache__ to be immutable except for what the interpreter >> puts in that directory during execution could you optimize that step (in >> which case you can stat the directory once and never care again as the set >> would be just updated by import whenever a new .pyc file was written). >> >> Having said all of this, implementing this idea would be trivial using >> importlib if you don't try to optimize the __pycache__ case. It's just a >> question of whether people are comfortable with the semantic change to >> import. This could also be made into something that was in importlib for >> people to use when desired if we are too worried about semantic changes. >> > > Yep. I was actually thinking this could be backported to 2.x, even > without importlib, as a module to be imported in sitecustomize or via a > .pth file. All it needs is a path hook, after all, and a subclass of the > pkgutil importer to test it. And if we can get some people with huge NFS > libraries and/or zillions of .egg directories on sys.path to test it, we > could find out whether it's a win, lose, or draw for those scenarios. > You can do that if you want, obviously I don't want to bother since it won't make it into Python 2.7. > >
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com