[issue38764] Deterministic globbing.

Nathaniel Smith Sun, 10 Nov 2019 20:27:14 -0800

Nathaniel Smith <n...@pobox.com> added the comment:

> I saw the article as well, but think auto-sorting would have just provided a 
> thin mask over their serious data pipeline bugs.


This seems like an inappropriately elitist attitude. I'm sure their code has 
bugs; so does mine and yours. But in fact they did test their code thoroughly, 
demonstrated that it produced correct results on their system, and it would 
have produced correct results for their downstream users too if Python's 'glob' 
had behaved consistently across platforms. Python isn't just for folks who can 
"git gud" to meet your arbitrary standards.

In addition, auto-sorting has a number of ergonomic benefits, beyond this one 
case: it makes output more deterministic, makes it easy to estimate progress 
given just a list of filenames being processed ("hmm, this script has been 
running for ten minutes and it's only on the Cs, I guess this will take a few 
hours"), and so on.

> Also, this isn't the only pathway to seeing file lists:  os.listdir, fnmatch, 
> etc.

os.listdir is a genuinely low-level OS-level tool, as indicated by the name, 
versus 'glob' which is designed as a convenient interface for high-level 
scripting.

fnmatch doesn't deal with either files or lists, so I'm not sure why you're 
bringing it up here.

> The reasons for the previous rejections still apply.

The reasons I see in the previous issues are:

- Sorting adds overhead
- It's hard to implement, esp. if iglob uses scandir internally
- iglob can't sort, so glob should be consistent
- it's easy to sort after the fact

But these are all false.

- The overhead added by sorting is negligible. In some informal benchmarks I 
measured it as ~4% extra CPU time in the worst case, if you're (a) running on 
Linux with it's ultra-optimized VFS caching, (b) all the directories are 
already cached in RAM, (c) you're not actually doing anything with the files 
except calling glob() and then discarding the result. Of course, it's always 
possible my benchmarks missed an important case; if you have benchmarks that 
show otherwise please post them so we can discuss.

- The implementation is trivial, as shown from the PR – literally just 
replacing two calls to 'list' with calls to 'sorted'.

- iglob being incremental doesn't actually pose any obstacles to sorting. Even 
if it's using scandir(), it still has to load each individual directory list 
into memory in one batch, to avoid the risk of leaking file descriptors.

- It's actually not trivial to sort filenames in the natural order afterwards, 
because you have to split each filename into components. It's also 
asymptotically slower than doing the sort as-you-go, O(total-files log 
total-files), versus O(total-files log maximum-individual-directory).

Given that there's at least one extremely clear case where not-sorting caused 
substantial harm, that there are lots of other minor cases where it's a nice 
win, and that the costs are negligible, sorting seems like the obviously 
correct default.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38764>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38764] Deterministic globbing.

Reply via email to