On 07/08/2014 06:08 PM, Ben Hoyt wrote:

Just like an attribute does not imply a system call, having a
method named 'is_dir' /does/ imply a system call, and not
having one can be just as misleading.

Why does a method imply a system call? os.path.join() and str.lower()
don't make system calls. Isn't it just a matter of clear
documentation? Anyway -- less philosophical discussion below.

In this case because the names are exactly the same as the os versions which 
/do/ make a system call.


I presume you're suggesting that is_dir/is_file/is_symlink should be
regular attributes, and accessing them should never do a system call.
But what if the system doesn't support d_type (eg: Solaris) or the
d_type value is DT_UNKNOWN (can happen on Linux, OS X, BSD)? The
options are:

So if I'm finally understanding the root problem here:

  - listdir returns a list of strings, one for each filename and one for
    each directory, and keeps no other O/S supplied info.

  - os.walk, which uses listdir, then needs to go back to the O/S and
    refetch the thrown-away information

  - so it's slow.

The solution:

  - have scandir /not/ throw away the O/S supplied info

and the new problem:

  - not all O/Ses provide the same (or any) extra info about the
    directory entries

Have I got that right?

If so, I still like the attribute idea better (surprise!), we just need to revisit the 'ensure_lstat' (or whatever it's called) parameter: instead of a true/false value, it could have a scale:

  - 0 = whatever the O/S gives us

  - 1 = at least the is_dir/is_file (whatever the other normal one is),
        and if the O/S doesn't give it to us for free than call lstat

  - 2 = we want it all -- call lstat if necessary on this platform

After all, the programmer should know up front how much of the extra info will be needed for the work that is trying to be done.


We have a choice before us, a fork in the road. :-) We can choose one
of these options for the scandir API:

1) The current PEP 471 approach. This solves the issue with d_type
being missing or DT_UNKNOWN, it doesn't require onerror, and it's a
really tidy API that doesn't explode with AttributeErrors if you write
code on Windows (without thinking too hard) and then move to Linux. I
think all of these points are important -- the cross-platform one not
the least, because we want to make it easy, even *trivial*, for people
to write cross-platform code.

Yes, but we don't want a function that sucks equally on all platforms.  ;)


2) Nick Coghlan's model of only fetching the lstat value if
ensure_lstat=True, and including an onerror callback for error
handling when scandir calls lstat internally. However, as described,
we'd also need an ensure_type=True option, so that scandir() isn't way
slower than listdir() if you actually don't want the is_X values and
d_type is missing/unknown.

With the multi-level version of 'ensure_lstat' we do not need an extra 
'ensure_type'.

For reference, here's what get_tree_size() looks like with this approach, not 
including error handling with onerror:

  def get_tree_size(path):
       total = 0
       for entry in os.scandir(path, ensure_lstat=1):
           if entry.is_dir:
               total += get_tree_size(entry.full_name)
           else:
               total += entry.lstat_result.st_size
       return total

And if we added the onerror here it would be a line fragment, as opposed to the extra four lines (at least) for the try/except in the first example (which I cut).


Finally:

Thank you for writing scandir, and this PEP.  Excellent work.

Oh, and +1 for option 2, slightly modified.  :)

--
~Ethan~
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to