On Tuesday, October 25, 2016 at 6:26:17 PM UTC-4, Nathaniel Smith wrote: > > On Sat, Oct 22, 2016 at 9:02 AM, Nick Coghlan <ncog...@gmail.com > <javascript:>> wrote: > > On 20 October 2016 at 07:02, Nathaniel Smith <n...@pobox.com > <javascript:>> wrote: > >> The first change is to replace the outer for loop with a while/pop > >> loop, so that if an exception occurs we'll know which iterables remain > >> to be processed: > >> > >> def chain(*iterables): > >> try: > >> while iterables: > >> for element in iterables.pop(0): > >> yield element > >> ... > >> > >> Now, what do we do if an exception does occur? We need to call > >> iterclose on all of the remaining iterables, but the tricky bit is > >> that this might itself raise new exceptions. If this happens, we don't > >> want to abort early; instead, we want to continue until we've closed > >> all the iterables, and then raise a chained exception. Basically what > >> we want is: > >> > >> def chain(*iterables): > >> try: > >> while iterables: > >> for element in iterables.pop(0): > >> yield element > >> finally: > >> try: > >> operators.iterclose(iter(iterables[0])) > >> finally: > >> try: > >> operators.iterclose(iter(iterables[1])) > >> finally: > >> try: > >> operators.iterclose(iter(iterables[2])) > >> finally: > >> ... > >> > >> but of course that's not valid syntax. Fortunately, it's not too hard > >> to rewrite that into real Python -- but it's a little dense: > >> > >> def chain(*iterables): > >> try: > >> while iterables: > >> for element in iterables.pop(0): > >> yield element > >> # This is equivalent to the nested-finally chain above: > >> except BaseException as last_exc: > >> for iterable in iterables: > >> try: > >> operators.iterclose(iter(iterable)) > >> except BaseException as new_exc: > >> if new_exc.__context__ is None: > >> new_exc.__context__ = last_exc > >> last_exc = new_exc > >> raise last_exc > >> > >> It's probably worth wrapping that bottom part into an iterclose_all() > >> helper, since the pattern probably occurs in other cases as well. > >> (Actually, now that I think about it, the map() example in the text > >> should be doing this instead of what it's currently doing... I'll fix > >> that.) > > > > At this point your code is starting to look a whole lot like the code > > in contextlib.ExitStack.__exit__ :) > > One of the versions I tried but didn't include in my email used > ExitStack :-). It turns out not to work here: the problem is that we > effectively need to enter *all* the contexts before unwinding, even if > trying to enter one of them fails. ExitStack is nested like (try (try > (try ... finally) finally) finally), and we need (try finally (try > finally (try finally ...))) But this is just a small side-point > anyway, since most code is not implementing complicated > meta-iterators; I'll address your real proposal below. > > > Accordingly, I'm going to suggest that while I agree the problem you > > describe is one that genuinely emerges in large production > > applications and other complex systems, this particular solution is > > simply far too intrusive to be accepted as a language change for > > Python - you're talking a fundamental change to the meaning of > > iteration for the sake of the relatively small portion of the > > community that either work on such complex services, or insist on > > writing their code as if it might become part of such a service, even > > when it currently isn't. Given that simple applications vastly > > outnumber complex ones, and always will, I think making such a change > > would be a bad trade-off that didn't come close to justifying the > > costs imposed on the rest of the ecosystem to adjust to it. > > > > A potentially more fruitful direction of research to pursue for 3.7 > > would be the notion of "frame local resources", where each Python > > level execution frame implicitly provided a lazily instantiated > > ExitStack instance (or an equivalent) for resource management. > > Assuming that it offered an "enter_frame_context" function that mapped > > to "contextlib.ExitStack.enter_context", such a system would let us do > > things like: > > So basically a 'with expression', that gives up the block syntax -- > taking its scope from the current function instead -- in return for > being usable in expression context? That's a really interesting, and I > see the intuition that it might be less disruptive if our implicit > iterclose calls are scoped to the function rather than the 'for' loop. > > But having thought about it and investigated some... I don't think > function-scoping addresses my problem, and I don't see evidence that > it's meaningfully less disruptive to existing code. > > First, "my problem": > > Obviously, Python's a language that should be usable for folks doing > one-off scripts, and for paranoid folks trying to write robust complex > systems, and for everyone in between -- these are all really important > constituencies. And unfortunately, there is a trade-off here, where > the changes we're discussing effect these constituencies differently. > But it's not just a matter of shifting around a fixed amount of pain; > the *quality* of the pain really changes under the different > proposals. > > In the status quo: > - for one-off scripts: you can just let the GC worry about generator > and file handle cleanup, re-use iterators, whatever, it's cool > - for robust systems: because it's the *caller's* responsibility to > ensure that iterators are cleaned up, you... kinda can't really use > generators without -- pick one -- (a) draconian style guides (like > forbidding 'with' inside generators or forbidding bare 'for' loops > entirely), (b) lots of auditing (every time you write a 'for' loop, go > read the source to the generator you're iterating over -- no > modularity for you and let's hope the answer doesn't change!), or (c) > introducing really subtle bugs. Or all of the above. It's true that a > lot of the time you can ignore this problem and get away with it one > way or another, but if you're trying to write robust code then this > doesn't really help -- it's like saying the footgun only has 1 bullet > in the chamber. Not as reassuring as you'd think. It's like if every > time you called a function, you had to explicitly say whether you > wanted exception handling to be enabled inside that function, and if > you forgot then the interpreter might just skip the 'finally' blocks > while unwinding. There's just *isn't* a good solution available. > > In my proposal (for-scoped-iterclose): > - for robust systems: life is great -- you're still stopping to think > a little about cleanup every time you use an iterator (because that's > what it means to write robust code!), but since the iterators now know > when they need cleanup and regular 'for' loops know how to invoke it, > then 99% of the time (i.e., whenever you don't intend to re-use an > iterator) you can be confident that just writing 'for' will do exactly > the right thing, and the other 1% of the time (when you do want to > re-use an iterator), you already *know* you're doing something clever. > So the cognitive overhead on each for-loop is really low. > - for one-off scripts: ~99% of the time (actual measurement, see > below) everything just works, except maybe a little bit better. 1% of > the time, you deploy the clever trick of re-using an iterator with > multiple for loops, and it breaks, so this is some pain. Here's what > you see: > > gen_obj = ... > for first_line in gen_obj: > break > for lines in gen_obj: > ... > > Traceback (most recent call last): > File "/tmp/foo.py", line 5, in <module> > for lines in gen_obj: > AlreadyClosedIteratorError: this iterator was already closed, > possibly by a previous 'for' loop. (Maybe you want > itertools.preserve?) > > (We could even have a PYTHONDEBUG flag that when enabled makes that > error message include the file:line of the previous 'for' loop that > called __iterclose__.) > > So this is pain! But the pain is (a) rare, not pervasive, (b) > immediately obvious (an exception, the code doesn't work at all), not > subtle and delayed, (c) easily googleable, (d) easy to fix and the fix > is reliable. It's a totally different type of pain than the pain that > we currently impose on folks who want to write robust code. > > Now compare to the new proposal (function-scoped-iterclose): > > - For those who want robust cleanup: Usually, I only need an iterator > for as long as I'm iterating over it; that may or may not correspond > to the end of the function (often won't). When these don't coincide, > it can cause problems. E.g., consider the original example from my > proposal: > > def read_newline_separated_json(path): > with open(path) as f: > for line in f: > yield json.loads(line) > > but now suppose that I'm a Data Scientist (tm) so instead of having 1 > file full of newline-separated JSON, I have a 100 gigabytes worth of > the stuff stored in lots of files in a directory tree. Well, that's no > problem, I'll just wrap that generator: > > def read_newline_separated_json_tree(tree): > for root, _, paths in os.walk(tree): > for path in paths: > for document in read_newline_separated_json(join(root, > path)): > yield document > >
> And then I'll run it on PyPy, because that's what you do when you have > 100 GB of string processing, and... it'll crash, because the call to > read_newline_separated_tree ends up doing thousands of calls to > read_newline_separated_json, but never cleans up any of them up until > the function exits, so eventually we run out of file descriptors. > I still don't understand why you can't write it like this: def read_newline_separated_json_tree(tree): for root, _, paths in os.walk(tree): for path in paths: with read_newline_separated_json(join(root, path)) as iterable: yield from iterable Zero extra lines. Works today. Does everything you want. > > A similar situation arises in the main loop of something like an HTTP > server: > > while True: > request = read_request(sock) > for response_chunk in application_handler(request): > send_response_chunk(sock) > Same thing: while True: request = read_request(sock) with application_handler(request) as iterable: for response_chunk in iterable: send_response_chunk(sock) I'll stop posting about this, but I don't see the motivation behind this proposals except replacing one explicit context management line with a hidden "line" of cognitive overhead. I think the solution is to stop returning an iterable when you have state needing a cleanup. Instead, return a context manager and force the caller to open it to get at the iterable. Best, Neil > > Here we'll accumulate arbitrary numbers of un-closed > application_handler generators attached to the stack frame, which is > no good at all. And this has the interesting failure mode that you'll > probably miss it in testing, because most clients will only re-use a > connection a small number of times. > So what this means is that every time I write a for loop, I can't just > do a quick "am I going to break out of the for-loop and then re-use > this iterator?" check -- I have to stop and think about whether this > for-loop is nested inside some other loop, etc. And, again, if I get > it wrong, then it's a subtle bug that will bite me later. It's true > that with the status quo, we need to wrap, X% of for-loops with 'with' > blocks, and with this proposal that number would drop to, I don't > know, (X/5)% or something. But that's not the most important cost: the > most important cost is the cognitive overhead of figuring out which > for-loops need the special treatment, and in this proposal that > checking is actually *more* complicated than the status quo. > > - For those who just want to write a quick script and not think about > it: here's a script that does repeated partial for-loops over a > generator object: > > > https://github.com/python/cpython/blob/553a84c4c9d6476518e2319acda6ba29b8588cb4/Tools/scripts/gprof2html.py#L40-L79 > > > (and note that the generator object even has an ineffective 'with > open(...)' block inside it!) > > With the function-scoped-iterclose, this script would continue to work > as it does now. Excellent. > > But, suppose that I decide that that main() function is really > complicated and that it would be better to refactor some of those > loops out into helper functions. (Probably actually true in this > example.) So I do that and... suddenly the code breaks. And in a > rather confusing way, because it has to do with this complicated > long-distance interaction between two different 'for' loops *and* > where they're placed with respect to the original function versus the > helper function. > > If I were an intermediate-level Python student (and I'm pretty sure > anyone who is starting to get clever with re-using iterators counts as > "intermediate level"), then I'm pretty sure I'd actually prefer the > immediate obvious feedback from the for-scoped-iterclose. This would > actually be a good time to teach folks about this aspect of resource > handling, actually -- it's certainly an important thing to learn > eventually on your way to Python mastery, even if it isn't needed for > every script. > > In the pypy-dev thread about this proposal, there's some very > distressed emails from someone who's been writing Python for a long > time but only just realized that generator cleanup relies on the > garbage collector: > > https://mail.python.org/pipermail/pypy-dev/2016-October/014709.html > https://mail.python.org/pipermail/pypy-dev/2016-October/014720.html > > It's unpleasant to have the rug pulled out from under you like this > and suddenly realize that you might have to go re-evaluate all the > code you've ever written, and making for loops safe-by-default and > fail-fast-when-unsafe avoids that. > > Anyway, in summary: function-scoped-iterclose doesn't seem to > accomplish my goal of getting rid of the *type* of pain involved when > you have to run a background thread in your brain that's doing > constant paranoid checking every time you write a for loop. Instead it > arguably takes that type of pain and spreads it around both the > experts and the novices :-/. > > ------------- > > Now, let's look at some evidence about how disruptive the two > proposals are for real code: > > As mentioned else-thread, I wrote a stupid little CPython hack [1] to > report when the same iterator object gets passed to multiple 'for' > loops, and ran the CPython and Django testsuites with it [2]. Looking > just at generator objects [3], across these two large codebases there > are exactly 4 places where this happens. (Rough idea of prevalence: > these 4 places together account for a total of 8 'for' loops; this is > out of a total of 11,503 'for' loops total, of which 665 involve > generator objects.) The 4 places are: > > 1) CPython's Lib/test/test_collections.py:1135, > Lib/_collections_abc.py:378 > > This appears to be a bug in the CPython test suite -- the little MySet > class does 'def __init__(self, itr): self.contents = itr', which > assumes that itr is a container that can be repeatedly iterated. But a > bunch of the methods on collections.abc.Set like to pass in a > generator object here instead, which breaks everything. If repeated > 'for' loops on generators raised an error then this bug would have > been caught much sooner. > > 2) CPython's Tools/scripts/gprof2html.py lines 45, 54, 59, 75 > > Discussed above -- as written, for-scoped-iterclose would break this > script, but function-scoped-iterclose would not, so here > function-scoped-iterclose wins. > > 3) Django django/utils/regex_helper.py:236 > > This code is very similar to the previous example in its general > outline, except that the 'for' loops *have* been factored out into > utility functions. So in this case for-scoped-iterclose and > function-scoped-iterclose are equally disruptive. > > 4) CPython's Lib/test/test_generators.py:723 > > I have to admit I cannot figure out what this code is doing, besides > showing off :-). But the different 'for' loops are in different stack > frames, so I'm pretty sure that for-scoped-iterclose and > function-scoped-iterclose would be equally disruptive. > > Obviously there's a bias here in that these are still relatively > "serious" libraries; I don't have a big corpus of one-off scripts that > are just a big __main__, though gprof2html.py isn't far from that. (If > anyone knows where to find such a thing let me know...) But still, the > tally here is that out of 4 examples, we have 1 subtle bug that > iterclose might have caught, 2 cases where for-scoped-iterclose and > function-scoped-iterclose are equally disruptive, and only 1 where > function-scoped-iterclose is less disruptive -- and in that case it's > arguably just avoiding an obvious error now in favor of a more > confusing error later. > > If this reduced the backwards-incompatible cases by a factor of, like, > 10x or 100x, then that would be a pretty strong argument in its favor. > But it seems to be more like... 1.5x. > > -n > > [1] > https://github.com/njsmith/cpython/commit/2b9d60e1c1b89f0f1ac30cbf0a5dceee835142c2 > > [2] CPython: revision b0a272709b from the github mirror; Django: > revision 90c3b11e87 > [3] I also looked at "all iterators" and "all iterators with .close > methods", but this email is long enough... basically the pattern is > the same: there are another 13 'for' loops that involve repeated > iteration over non-generator objects, and they're roughly equally > split between spurious effects due to bugs in the CPython test-suite > or my instrumentation, cases where for-scoped-iterclose and > function-scoped-iterclose both cause the same problems, and cases > where function-scoped-iterclose is less disruptive. > > -n > > -- > Nathaniel J. Smith -- https://vorpus.org > _______________________________________________ > Python-ideas mailing list > python...@python.org <javascript:> > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/