Phillip J. Eby added the comment:

Just a note: the zip files produced by the distutils and friends (sdist, 
bdist_dumb, eggs) do not include entries for plain directories.  I would guess 
that this is also true for wheels at the moment, unless something was 
specifically done to work around this property of distutils-generated zip 
files.  So ISTM the right thing to do is to synthesize the entries at directory 
read time, when they're being looped over anyway.

Reviewing the patch, there is a performance optimization possible by making a 
slight change to the algorithm.  Currently the patch loops from the start of 
the string to the end, looking for path prefixes.  This means that the total 
overall performance is determined by the length of the strings and especially 
the average directory depth.

However, there is a significant shortcut possible: looping from the *end* of 
each string to the beginning, it's possible to break out of the loop if the 
prefix has already been seen -- thus saving (depth-1) dictionary lookups in the 
average case, and only looking at the characters in the base filename, unless a 
new directory is encountered... for a typical overhead of one unicode 
substring, dictionary lookup, and strrchr per zipfile directory entry.  (Which 
is very small compared to what else is going on at that point in the process.)

To elaborate, if you have paths of the form:

x/y/a
x/y/b
x/y/c/d

Then when processing 'x/y/a', you would first process x/y -- it's not in the 
dict, add it.  Then x -- not in the dict, add it.  Then you go to x/y/b, your 
first parent is x/y again -- but since it's in the dict you skip it, and don't 
even bother with the x.  Next you see x/y/c, which is not in the dict, so you 
add it, then x/y, which is, so you break out of the loop for that item.

Basically, about all that would change would be the for() loop starting at the 
end of the string and going to the beginning, with the loop position still 
representing the end of the prefix to be extracted.  And the PyDict_Contains 
check would result in a break rather than a continue.

So, if the only concern keeping the patch from being accepted is that it adds 
to startup time, this approach would cut down quite a bit on the overhead for 
generating the path information, in cases of repeated prefixes.  (And in the 
common cases for zipfile use on sys.path, one would expect to see a lot of 
common prefixes, if only for package names.)

----------
nosy: +pje

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue14905>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to