[Python-ideas] Re: [Suspected Spam]Re: Adding slice Iterator to Sequences (was: islice with actual slices)

Andrew Barnert via Python-ideas Wed, 13 May 2020 19:54:41 -0700

On May 13, 2020, at 12:40, Christopher Barker <[email protected]> wrote:

I hope you don’t mind, but I’m going to take your reply out of order to get the 
most important stuff first, in case anyone else is still reading. :)

>> Back to the Sequence View idea, I need to write this up properly, but I'm 
>> thinking something like:
> 
> (using a concrete example or list)
> 
> list.view is a read-only property that returns an indexable object.
> indexing that object with a slice returns a list_view object
> 
> a_view = list.view[a:b:c]
> 
> a_view is a list_ view object
> 
> a list_view object is a immutable sequence. indexing it returns elements from 
> the original list.

Can we just say that it returns an immutable sequence that blah blah, without 
defining or naming the type of that sequence?

Python doesn’t define the types of most things you never construct directly. 
(Sometimes there is a public name for it buried away in the types module, but 
it’s not mentioned anywhere else.) Even the dict view objects, which need a 
whole docs section to describe them, never say what type they are.

And I think this is intentional. For example, nowhere does it say what type 
function.__get__ returns, only what behavior that object has—and that allowed 
Python 3 to get rid of unbound methods, because a function already has the 
right behavior. And nobody even notices that list and tuple use the same type 
for their __iter__ in some Python implementations but not others. Similarly, I 
think dict.__iter__() used to return a different type from 
dict.keys().__iter__() in CPython but now they share a type, and that didn’t 
break any backward compatibility guarantees.

And it seems there’s no reason you couldn’t use the same generic sequence view 
type on all sequences, but also it’s possible that a custom one for list and 
tuple might allow some optimization (and even more likely so for range, 
although it may be less important). So if you don’t specify the type, that can 
be left up to each version of each implementation to decide.

> slicing a list view returns ???? I'm not sure what here -- it should probably 
> be a copy, so a new list_view object refgerenceing the same list? That will 
> need to be thought out carefully)

Good question. I suppose there are three choices: (1) a list (or, in general, 
whatever the original object returns from slicing), (2) a new view of the same 
list, or (3) a view of the view of the list.

I think I agree with you here that (2) is the best option. In other words, 
lst.view[2::2][1::3] gives you the exact same thing as lst.view[4::6].

At first that sounds weird because if you can inspect the attributes of the 
view object, there’s way to see that you did a [1::3] anywhere.

But that’s exactly the same thing that happens with, e.g,, 
range(100)[2::2][1::3]. You just get range(4, 100, 6), and there’s no way to 
see that you did a [1::3] anywhere.

And the same is true for memoryview, and for numpy arrays and bintrees tree 
slices—despite them being radically different things in lots of other ways, 
they all made the same choice here. And even beyond Python, it’s what slicing a 
slice view does in Swift (even though other kinds of views of views don’t 
“flatten out” like this, slice views of slice views do), and in Go. (Although 
C++20 is a counterexample here.)

> calling.view on a list_view is another trick -- does it reference the host 
> view? or go straight back to the original sequence?

I think it’s the same answer again. In fact, I think .view on any slice view 
should just return self.

Think about it: whether you decided that lst.view[2::2][1::3] gives 
lst.view[4::6] or a nested view-of-a-view-of-a-list, it would be confusing if 
lst.view[2::2].view[1::3] gave you the other one, and what other options would 
make sense? And, unless there’s some other behavior besides slicing on view 
properties, if self.view slices the same as self, it might as well just be self.

> iter(a_list_view) returns a list_viewiterator.

Here, it seems even more useful to leave the type unspecified. For list (and 
tuple) in CPython, I’m not sure if you can get away with using the special 
list_iterator type used by list and tuple (which accesses the underlying array 
directly), or, if not that, the PySeqIter type used for old-style 
iter-by-indexing, but if you can, it would be both simpler and more efficient. 
And similarly, range.view might be able to use the range_iterator type. Or, if 
you can’t do that, a generic PyIter around tp_next would be less efficient than 
a custom type, but again simpler, and the efficiency might not matter. Or, if 
you just had a single sequence view type rather than custom ones for each 
sequence type, that would obviously mean a single iterator type. And so on. 
That all seems like quality-of-implementation stuff that should be left open to 
whatever turns out to be best.

> iterating that gets you items from the "host" "on the fly.
> 
> All this is a fair bit more complicated than my original idea -- which was to 
> not have a full view, but simply an iterator you can get from slice notation. 
> 
> But it would also open up a world of possibilities!

Yes, in the same way that range (and 2.x xrange) is more complicated but more 
useful than a hypothetical irange and 3.x dict.keys() (and 2.7 dict.viewkeys()) 
is more complicated but more useful than 2.6 dict.iterkeys(). I think it’s 
worth it, but it is a trade off.

Now onto the stuff that probably nobody else cares about:

> It took me a good while to "get" the distinction between an itertor and an 
> iterable, and I still misuse those terms sometimes.
> 
> Maybe because iterable is an awkward word (that my spell checked doesn't 
> recognize)?

My spellchecker is happy with Iterable with a capital I (because it’s seen me 
type so much Python code?) but complains about iterable with a lowercase i. Or 
just autocorrects it—sometimes to capital-I Iterable, sometimes to utterable. 
(Which I wouldn’t think is a word that comes up often enough in anyone’s usage 
to be a common autocorrect target. Maybe unutterable, but even then only if 
you’re talking about Lovecraftian horror or religious mysticism.)

> But it's also because there is a clear definition for "Iterator" in Python, 
> bu the term is used a bit more generally in vague CS nomenclature.

Yes. And in different languages, too. In C++, iterators are an abstraction of 
pointers; in OCaml they’re an abstraction of HOFs like map; worst of all, Swift 
built everything around these three concepts they call “sequence”, “iterator”, 
and “generator”, clearly aimed at getting the best of both worlds from Python 
and C++, but all of those concepts mean the wrong thing if you’re coming from 
either language, and then they changed things between 1.0 and 2.0 just in case 
anyone wasn’t confused yet.

> The other confusion is that an iterable is not an iterator, but iterators 
> are, in fact, iterables (i.e. you can all iter() on them).

Yes. Which is essential to a lot of things about Python’s design, but not 
essential to the concept at an abstract CS level.

> I think this is mostly the result of the "for loop" protocol pre-dating the 
> iteration protocol, and wanting to have the same nifty way to iterate 
> everything. That is -- we want to be able to use iterators in for loops, and 
> not have to call iter() in anything before using a for loop. But in fact, I 
> think this is a nice convenience, and mayb one that would be kept in a new 
> language anyway -- it's really handy that you can do A LOT without knowing 
> about iter() and next() and StopIteration, while those tools are stil there 
> when needed.

I’m not sure about that. There are at least two ways to design a language that 
doesn’t need both concepts, and both have been tried, even if nobody’s been 
quite successful yet.

The first is the C++ way: just put iterators front and center and make people 
call iter (or, in their case, begin and end) all over the place. This is pretty 
easy to understand, and it has some nice advantages (like being able to loop 
over C strings and arrays without wrapping them). It’s just not actually usable 
in everyday code unless you start layering a bunch of stuff on top of it, at 
which point you’ve only avoided the concept of “iterable” by making people 
learn the concept of “implicitly convertible to iterator range” instead.

The second is the Swift way (I’m going to use Python terms rather than Swift 
ones here to avoid confusion): hide iterators as much as possible. (Java and C# 
are also gradually moving in this direction, but have a lot more legacy 
weighing them down.) In Swift, you can’t loop over iterators, or pass them to 
functions like map—and that’s fine, because functions like map don’t return 
iterators, they return views. The only place you ever see an iterator in the 
wild is inside the implementation of a handful of functions like map and zip 
that really do need to munge iterators manually, and many people will never 
even read, much less write, such a function. If you do happen to get an 
iterator somehow and want to use it as an iterable, you have to wrap it in a 
trivial view object that delegates to it, but this almost never comes up. 
Sadly, this makes it so much harder to write your stdlib that Apple took three 
tries (after going public) before they got it right.

Some day, someone probably will design a language that doesn’t require most 
people to learn both concepts and is actually usable. Until then, I’m happy 
we’ve got Python. :)

> Bringing this back to the original topic:
> 
> I suppose we *could* have a "file_view" object that acted like the list you 
> get from readlines(), but actually called seek() on the underlying file to 
> give you the lines lazily one at a time. That would be, shall we say, 
> problematic, performance wise, but it could be done.

I remember learning that the way to do this was the nifty new linecache module. 
Nobody seems to teach that anymore in the 3.x days, but it’s still there, and 
works as expected for Unicode text and everything.

But for something more general, you probably wouldn’t want to bother with a 
special file view. You can very easily write a generic view that takes _any_ 
iterator and looks like a sequence, pulling and caching the elements on demand. 
At a certain point, a lot of people think they want this, then you show them 
how easy it is to build that, and they think it’s cool—but they never use it 
again. Caching indices instead of the actual lines seems like a nice 
optimization, but you’d need a specific use case where the time cost is worth 
the space savings, and if nobody even uses the generic version, nobody needs to 
optimize it, right? :)

And now on to the stuff that maybe you don’t even care about:

>> On Wed, May 13, 2020 at 10:52 AM Andrew Barnert <[email protected]> wrote:
>>> On May 12, 2020, at 23:29, Stephen J. Turnbull 
>>> <[email protected]> wrote:
>>> >>>> A lot of people get this confused. I think the problem is that we
>>> >>>> don’t have a word for “iterable that’s not an iterator”,
>> 
>> isn't that simply an "Iterable" -- as above, yes, all iterators are 
>> iterables, but when we speak of iterators specifically, we are usually 
>> referring to the ones that are not an iterator.
> 
> No, we really aren’t. Iterators being iterable is not just a weird quirk that 
> rarely comes up; it’s essential to things you do every day.
> 
> The everyday concept behind “iterable” is “something you can use in a for 
> loop”. (You don’t have to get into the technical “something you can call iter 
> on and get an iterator” that often—but when you do, it’s easy to work out 
> that they’re identical concepts anyway.)
> 
> The main thing you do with generator expressions, zips, etc. is not call next 
> and check StopIteration, it’s stick them in a for loop (or generator 
> expressions or map or whatever), exactly the same way you use lists and sets 
> and ranges. So if you think of the word “iterable” is a way that doesn’t 
> include generators and zips and so on, you’re just going to confuse yourself.
> 
> > It *is* the distinction I'm making with the word "explicit".  I never
>> > use "next" on an open file.
> 
> nor do I, but there was a conversation on this list a while back, with folks 
> saying that they DID do that.

This is your mail agent being a pain again. You’re the one who said that, I 
quoted you saying it, and now you’re agreeing with yourself. Can we pass a law 
that anyone who’s worked on any of the major current mail clients is not 
allowed to work in software anymore? I think that would benefit the world more 
than any change we can make to Python…

Personally, I actually do next files. For example:

    with open(path) as f:
        next(f) # skip the first line of the 2-line header
        for row in csv.DictReader(f):

Of course I could have used f.readline() just as well, and I’ve seen as many 
people do the same thing with readline as with next. It just seems a little 
more unusual to ignore the result of readline than to ignore the result of 
next, so when writing it, next feels more natural.

> > Students often want to know why this doesn’t work:
>> 
>>     with open("file") as f:
>>         for line in file:
>>             do_stuff(line)
>>         for line in file:
>>             do_other_stuff(line)
>> 
>> … when this works fine:
>> 
>>     with open("file") as f:
>>         lines = file.readlines()
>>     for line in lines:
>>         do_stuff(line)
>>     for line in lines:
>>         do_other_stuff(line)
>> 
>> This question (or a variation on it) gets asked by novices every few day’s 
>> on StackOverflow; it’s one of the top common duplicates.
>> 
>> The answer is that files are iterators, while lists are… well, there is no 
>> word.
> 
> yes, there is -- they are "lists" :-) -- but if you want to be more general, 
> they are Sequences.

But that’s the wrong generalization. Because sets also work the same way, and 
they aren’t Sequences. Nor are dict views, or many of the other kinds of 
things-that-can-be-iterated-over-and-over-independently.

Plus, this just confuses what Sequences are about. Sequence is a dead simple 
concept: if seq[0] makes sense, it’s a sequence; if not, it isn’t.

(Sure, there’s other stuff crammed in there, like being reversible and 
in-testable and index-searchable, but all of that stuff is stuff you can 
obviously and trivially build on top of indexing, so you don’t need to think 
about it. And there’s the subtlety that 0 is a perfectly cromulent dict key, 
which unfortunately you do sometimes need to think about, but most of the time 
you don’t. For the most part, Sequence means you can index it.)

> Or heck, simply say that readlines() reads the whole file at once into a 
> list, and the file object has nothing to do with it anymore. Whereas looping 
> through the lines in a for loop is getting the lines one by one from the file 
> object, so once you've gotten them, all there are no more.
> 
> Which doesn't require me talking about iterators or iterables, or iter() or 
> next()

Sure, which is great right up until they ask the same question about why they 
can’t iterate twice over a map or zip. (Which is another very common novice dup 
on StackOverflow. It’s especially sad when they made a commendable start at 
debugging things on their own by writing `for pair in pairs: print(pair)`, 
which instead of rewarding them just made the problem even worse.)

Or why they _can_ iterate twice over a range, even though a range clearly isn’t 
building a whole list in advance. (Especially when they read in some blog that 
range used to return a list but now it doesn’t. Especially if the person 
writing that blog misused the word “iterator” in the same way you did earlier, 
which many of them do.)

>> You can explain it anyway. In fact, you _have_ to give an explanation with 
>> analogies and examples and so on, and that would be true even if there were 
>> a word for what lists are. But it would be easier to explain if there were 
>> such a word, and if you could link that word to something in the glossary, 
>> and a chapter in the tutorial.
> 
> Still not sure why "Sequence" doesn't work here? Granted, there *are* be some 
> "iterables that aren't iterators" that aren't Sequences (like dict views), 
> but they are Iterable Containers, and I think you can talk about them as 
> "views" well enough.

Again, surely you don’t want to tell people that sets, dicts, dict views, etc. 
are Sequences.

And if you say, “well, they aren’t Sequences but they are Containers”, that 
isn’t very helpful—a Container is a thing that supports “in”, which does happen 
to be true for those types, but it isn’t relevant, so that’s just confusing.

The word “view” _is_ great for things-like-dict-keys. That’s why I started off 
this thread asking for a view instead of an iterator, which I thought would be 
immediately clear. Unfortunately, it isn’t, or we wouldn’t even be having this 
discussion.

> Though now that I've written that, maybe we Should have "Iterable" and 
> "Iterator" as ABCs.

We already do. And Iterator is a subclass of Iterable, just as it should be.

We don’t have an ABC for iterables that give you a new iterator over their 
contents, that doesn’t use up those contents, every time you iterate them. But 
that’s not surprising given that we don’t have a word for it. ABCs are named 
based on either a protocol that already had a name (like Sequence or Coroutine 
or Rational) or a single method (like Reversible and Hashable), not the other 
way around. (The only exception I can think of is the ones in io, but they just 
prove the point—nobody talks about BufferedIOBases as a concept like Sequences 
or Coroutines, and on the very rare occasions where I need to type-check one, I 
have to go read the docs to see what I’m supposed to check and what it means to 
do so.)

>>  But the distinction between iterators and things-like-list-and-so-on comes 
>> up earlier, and a lot more often, so a word for that would buy us a lot more.
> 
> And "iterable" doesn't work?

No, it doesn’t. You can’t use “iterable” to mean things like lists and sets but 
not generators and files, because iterators are every bit as iterable.

This would be like saying you can just use “animal” for things like dogs and 
people but not frogs and birds, or “number” for things like 1/4 and -3/17 but 
not e and pi, or “Christian” for people like Lutherans and Methodists but not 
Catholics and Orthodox, etc. We have words for concepts like “mammal” and 
“rational” and “Protestant”, because you can’t just say “animal” and “number” 
and “Christian” or you’re being confusing.

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/45IXAQ3O7YCFVNH4SSNMURLG737YNOBB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: [Suspected Spam]Re: Adding slice Iterator to Sequences (was: islice with actual slices)

Reply via email to