On 10/05/2013 11:55, Ben Hoyt wrote:
A few of us were having a discussion at
http://bugs.python.org/issue11406 about adding os.scandir(): a
generator version of os.listdir() to make iterating over very large
directories more memory efficient. This also reflects how the OS gives
things to you -- it doesn't give you a big list, but you call a
function to iterate and fetch the next entry.
While I think that's a good idea, I'm not sure just that much is
enough of an improvement to make adding the generator version worth
it.
But what would make this a killer feature is making os.scandir()
generate tuples of (name, stat_like_info). The Windows directory
iteration functions (FindFirstFile/FindNextFile) give you the full
stat information for free, and the Linux and OS X functions
(opendir/readdir) give you partial file information (d_type in the
dirent struct, which is basically the st_mode part of a stat, whether
it's a file, directory, link, etc).
Having this available at the Python level would mean we can vastly
speed up functions like os.walk() that otherwise need to make an
os.stat() call for every file returned. In my benchmarks of such a
generator on Windows, it speeds up os.walk() by 9-10x. On Linux/OS X,
it's more like 1.5-3x. In my opinion, that kind of gain is huge,
especially on Windows, but also on Linux/OS X.
So the idea is to add this relatively low-level function that exposes
the extra information the OS gives us for free, but which os.listdir()
currently throws away. Then higher-level, platform-independent
functions like os.walk() could use os.scandir() to get much better
performance. People over at Issue 11406 think this is a good idea.
HOWEVER, there's debate over what kind of object the second element in
the tuple, "stat_like_info", should be. My strong vote is for it to be
a stat_result-like object, but where the fields are None if they're
unknown. There would be basically three scenarios:
1) stat_result with all fields set: this would happen on Windows,
where you get as much info from FindFirst/FindNext as from an
os.stat()
2) stat_result with just st_mode set, and all other fields None: this
would be the usual case on Linux/OS X
3) stat_result with all fields None: this would happen on systems
whose readdir()/dirent doesn't have d_type, or on Linux/OS X when
d_type was DT_UNKNOWN
Higher-level functions like os.walk() would then check the fields they
needed are not None, and only call os.stat() if needed, for example:
# Build lists of files and directories in path
files = []
dirs = []
for name, st in os.scandir(path):
if st.st_mode is None:
st = os.stat(os.path.join(path, name))
if stat.S_ISDIR(st.st_mode):
dirs.append(name)
else:
files.append(name)
Not bad for a 2-10x performance boost, right? What do folks think?
Cheers,
Ben.
[snip]
In the python-ideas list there's a thread "PEP: Extended stat_result"
about adding methods to stat_result.
Using that, you wouldn't necessarily have to look at st.st_mode. The
method could perform an additional os.stat() if the field was None. For
example:
# Build lists of files and directories in path
files = []
dirs = []
for name, st in os.scandir(path):
if st.is_dir():
dirs.append(name)
else:
files.append(name)
That looks much nicer.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com