[Python-ideas] Documenting iterators vs. iterables [was: Adding slice Iterator ...]

Stephen J. Turnbull Thu, 14 May 2020 20:22:27 -0700

Andrew Barnert writes:

 > And I’m pretty sure that’s exactly the confusion that led you to
 > think that dict_keys have weird behavior,


That wasn't me ....  I'm here to discuss documentation, not dict or
sequence views. ;-)  Changing the subject field to match.

 > Students often want to know why this doesn’t work:
 > 
 >     with open("file") as f:
 >         for line in file:
 >             do_stuff(line)
 >         for line in file:
 >             do_other_stuff(line)

Sure.  *Some* students do.  I've never gotten that question from mine,
though I do occasionally see

    with open("file") as f:
        for line in f:        # ;-)
            do_stuff(line)
    with open("file") as f:
        for line in f:
            do_other_stuff(line)

I don't know, maybe they asked the student next to them. :-)

 > The answer is that files are iterators, while lists are… well,
 > there is no word.

As Chris B said, sure there are words:  File objects are *already*
iterators, while lists are *not*.  My question is, "why isn't that
instructive?"

 > We shouldn’t define everything up front, just the most important
 > things. But this is one of the most important things. People need
 > to understand this distinction very early on to use Python,

No, they don't.  They neither understand, nor (to a large extent) do
they *need* to.

We cannot solve the problem of "lazy in the technical sense"
programming by improving Python.  It's a matter of optimizing
programmer effort.  If cargo culting and asking on Stack Overflow and
bitching on Twitter or your personal blog when software doesn't DWIM
is psychologically (and frequently time management-ly) cheaper than
learning How Things Work, that's what people are going to do.  I can't
tell them they're wrong (except my own students, and they mostly
ignore me until they run out of options other than listening to me :-).

ISTM that all we need to say is that

1.  An *iterator* is a Python object whose only necessary function is
    to return an object when next is applied to it.  Its purpose is to
    keep track of "next" for *for*.  (It might do other useful things
    for the user, eg, file objects.)

2.  The *for* statement and the *next* builtin require an iterator
    object to work.  Since for *always* needs an iterator object, it
    automatically converts the "in" object to an iterator implicitly.
    (Technical note: for the convenience of implementors of 'for',
    when iter is applied to an iterator, it always returns the
    iterator itself.)

3.  When a "generic" iterator "runs out", it's exhausted, it's truly
    done.  It is no longer useful, and there's nothing you can do but
    throw it away.  Generic iterators do not have a reset method.
    Specialized iterators may provide one, but most do not.

4.  Objects that can be converted to iterators are *iterables*.
    Trivially, iterators are iterable (see technical note supra).

5.  Most Python objects are not iterators, but many can be converted.
    However, some Python objects are constructed as iterators because
    they want to be "lazy".  Examples are files (so that a huge file
    can be processed line by line without reading the whole thing into
    memory) and "generators" which yield a new item each time they are
    called.

But AFAIK we *do* say that, and it doesn't get through.

 > I can teach a child why a glass will break permanently when you hit
 > it while a lake won’t by using the words “solid” and “liquid”.

Terrible example, since a glass is just a geologically slow liquid. ;-)

Back to the discussion: the child can touch both, and does so
frequently (assuming you don't feed them from the dog's bowl and also
bathe them regularly).  They've seen glasses break, most likely, and
splashed water.

Iterators have one overriding purpose: to be fed to *for* statements,
be exhausted, and then discarded.  This is so important that it's done
implicitly and in every single *for* statement.  We have the necessary
word, "iterator," but students don't have the necessary experience of
"touching" the iterator that *for* actually iterates over instead of
the list that is explicit in the *for* statement.  That iterator is
created implicitly and becomes garbage as soon as the *for* statement.
And there's no way for the student to touch it, it doesn't have a
name!

If you want to fix nomenclature, don't call them "files," don't call
them "file objects," call them "file iterators".  Then students have
an everyday iterator they can touch.  I'll guarantee that causes other
problems, though, and gets a ton of resistence.  Even from me. :-)

 > Yes, and defining terminology for the one distinction that almost
 > always is relevant helps distinguish that distinction from the
 > other ones that rarely come up. Most people (especially novices)
 > don’t often need to think about the distinction between iterables
 > that are sized and also containers vs. those that are not both
 > sized and containers, so the word for that doesn’t buy us much. But
 > the distinction between iterators and things-like-list-and-so-on
 > comes up earlier, and a lot more often, so a word for that would
 > buy us a lot more.

We have that word and distinction.  A file object *is* an iterator.  A
list is *not* an iterator.  *for* works *with* iterators internally,
and *on* iterables through the magic of __iter__.

 > > But you *don't* use seek(0) on files (which are not iterators, and in
 > > fact don't actually exist inside of Python, only names for them do).
 > > You use them on opened *file objects* which are iterators.
 > 
 > A file object is a file, in the same way that a list object is a
 > list and an int object is an int.

No, it's not the same: your level of abstraction is so high that
you've lost sight of the iterable/iterator distinction.  All of the
latter objects own their own data in a way that a file object does
not.  All of the latter objects are different from their iterators
(where such iterators exist), while the file object is not.

 > The fact that we use “file” ambiguously for a bunch of related but
 > contradictory abstractions (a stream that you can read or write, a
 > directory entry, the thing an inode points to, a document that an
 > app is working on, …) makes it a bit more confusing, but
 > unfortunately that ambiguity is forced on people before they even
 > get to their first attempt at programming, so it’s probably too
 > late for Python to help (or hurt).

Agreed.  I would be much happier if we could discuss an example that
is *not* iterating over files but *does* come up every day on
StackOverflow.  Maybe zips would work but I'm not sure the motivation
comes together the way it does for files (why do zips want to be lazy?
what are the compelling examples for zip of "restarting the iteration
where you left off" with a new *for* statement?)

 > >  When you open a file again, by default you get a new iterator
 > > which begins at the beginning, as you want for those others.
 > >  My point is that none of the other types you mention are iterators.
 > 
 > I don’t get what you’re driving at here.

Simply that we have the necessary distinction already: iterators vs.
everything else.  IMO the problem is that the students have zero or
very little experience of iterators other than files, and so think of
file objects as weird iterables, rather than as iterators.

 > Lists, sets, ranges, dict_keys, etc. are not iterators. You can
 > write `for x in xs:` over and over and get the values over and
 > over. Because each time, you get a new iterator over their values.

You and I know that, because we know what an iterator is, and we know
it's there because it has to be: *for* doesn't iterate anything but an
iterator.  But (except via a bytecode-level debugger) nobody has ever
seen that iterator.  You can use iter to get a similar iterator, of
course, but it's not the same object that any for statement ever
used.  (Unless you explicitly created it with iter, but then you can
re-run the for statement on it the way you do with a list.)

 > > The difference with files is just that they happen to exist in
 > > Python as iterables.  But after
 > 
 > _What_ exists in Python as iterables?

Lists, tuples, sets, dicts, and other containers.

 > Files, maps, zips, generators, etc. are not like that. They’re
 > iterators. If you write `for x in xs:` twice, you get nothing the
 > second time, because each time you’re using the same iterator, and
 > you’ve already used it up. Because iter(xs) is xs when it’s a file
 > or generator etc.

Genexps are iterators, but generators (in the sense of the product of
a def that contains "yield") are not even iterable.  Those are
iterator factories.

 > The only representation of files in Python is file objects—the
 > thing you get back from open (or socket.makefile or io.StringIO or
 > whatever else)—and those are iterators.

The thought occurred to me, "What if that was a bad decision?  Maybe
in principle files shouldn't be iterators, but rather iterables with a
real __iter__ that creates the iterable."  I realized that I'd already
answered my own question in part: I find it easy to imagine cases
where I'd want to get some lines of input from a file as a
higher-level unit, then stop and do some processing.  The killer app
for me is mbox files.  Another plausible case is reading top-level
Lisp expressions from a file (although that doesn't necessarily divide
neatly into lines.)  I also found it surprisingly complicated to think
about the consequences to the type of making that change.

Going back to the documentation theme, maybe one way to approach
explaining iterators is to start with the use case of files as
(non-seekable) streams, show how 'for iteration' can be "restarted"
where you left off in the file, and teach that "this is the canonical
behavior of iterators; lists etc are *iterable* because 'for'
automatically converts them to iterators "behind the scenes".

If sockets or pipes were more familiar to beginning programmers, they
might be better examples, but I think that files-as-streams might be
the most familiar and approachable, though real files are far more
flexible than just unseekable streams.

I'll try to take a look at the "official" tutorials and language
documentation "sometime soon" and see if maybe this idea could be
applied to improve them.

Steve
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/54J6KBD7YLAGQXN3VLYKG3GAPXLVRQFH/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Documenting iterators vs. iterables [was: Adding slice Iterator ...]

Reply via email to