[Python-ideas] Proposal: Complex comprehensions containing statements

Alex Hall Thu, 20 Feb 2020 23:59:15 -0800

This is a proposal for a new syntax where a comprehension is written as the 
appropriate brackets containing a loop which can contain arbitrary statements.


Here are some simple examples. Instead of:

    [
        f(x)
        for y in z
        for x in y
        if g(x)
    ]

one may write:

    [
        for y in z:
            for x in y:
                if g(x):
                    f(x)
    ]

Instead of:

    lst = []
    for x in y:
        if cond(x):
            break
        z = f(x)
        lst.append(z * 2)

one may write:

    lst = [
        for x in y:
            if cond(x):
                break
            z = f(x)
            yield z * 2
    ]
 
Instead of:

    [
        {k: v for k, v in foo}
        for foo in bar
    ]

one may write:

    [
        for foo in bar:
            {for k, v in foo: k: v}
    ]

## Specification

A list/set/dict comprehension or generator expression is written as the 
appropriate brackets containing a `for` or `while` loop.

In the general case some expressions have `yield` in front and they become the 
values of the comprehension, like a generator function.

If the comprehension contains exactly one expression statement at any level of 
nesting, i.e. if there is only one place where a `yield` can be placed at the 
start of a statement, then `yield` is not required and the expression is 
implicitly yielded. In particular this means that any existing comprehension 
translated into the new style doesn't require `yield`.

If the comprehension doesn't contain exactly one expression statement and 
doesn't contain a `yield`, it's a SyntaxError.

### Dictionary comprehensions

For dictionary comprehensions, a `key: value` pair is allowed as its own 
pseudo-statement or in a yield. It's not a real expression and cannot appear 
inside other expressions.

This can potentially be confused with variable type annotations with no 
assigned value, e.g. `x: int`. But we can essentially apply the same rule as 
other comprehensions: either use `yield`, or only have one place where a 
`yield` could be added in front of a statement. So if there is only one pair 
`x: y` we try to implicitly yield that. The only way this could be 
misinterpreted is if a user declared the type of exactly one expression and 
completely forgot to give their comprehension elements, and the program would 
almost certainly fail spectacularly.

### Whitespace

If placing the loop on a single line would be valid syntax outside a 
comprehension (i.e. it just contains a simple statement) then we call this an 
*inline* comprehension. It can be inserted in the same line(s) as other code 
and formatted however the writer likes - there are no concerns about whitespace.

For a more complex comprehension, the loop must start and end with a newline, 
i.e. the lines containing the loop cannot contain any tokens from outside, 
including the enclosing brackets. For example, this is allowed:

    foo = [
        for x in y:
            if x > 0:
                f(x)
    ]

but this is not:

    foo = [for x in y:
               if x > 0:
                   f(x)]

This ensures that code is readable even at a quick glance. The eyes can quickly 
find where the loop starts and distinguish the embedded statements from the 
rest of the enclosing expression.

Furthermore, it's easy to copy paste entire lines to move them around, whereas 
refactoring the invalid example above without specific tools would be annoying 
and error-prone. It also makes it easy to adjust code outside the comprehension 
(e.g. rename `foo` to something longer) without messing up indentation and 
alignment.

Inside the loop, the rules for indentation and such are the same as anywhere 
else. The syntax of the loop is valid only if it's also valid as a normal loop 
outside any expression. The body of the loop must be more indented than the 
for/while keyword that starts the loop.

### Variable scope

Since comprehensions look like normal loops they should maybe behave like them 
again, including executing in the same scope and 'leaking' the iteration 
variable(s). Assignments via the walrus operator already affect the outer 
scope, only the iteration variable currently behaves differently. My 
understanding is that this is influenced by the fact that there is little 
reason to use the value of the iteration variable after a list comprehension 
completes since it will always be the last value in the iterable. But since the 
new syntax allows `break`, the value may become useful again.

I don't know what the right approach is here and I imagine it can generate 
plenty of debate. Given that this whole proposal is already controversial and 
likely to be rejected this may not be the best place to start discussion. But 
maybe it is, I don't know.

## Benefits/comparison to current methods

### Uniform syntax

The new comprehensions just look like normal loops in brackets, or generator 
functions. This should make them easier for beginners to learn than the old 
comprehensions.

A particular concept that's easier to learn is comprehensions that contain 
multiple loops. Consider this comprehension over a nested list:

    [
        f(cell)
        for row in matrix
        for cell in row
    ]

For beginners this can easily be confusing, [and sometimes for experienced 
coders 
too](https://mail.python.org/archives/list/python-ideas@python.org/message/BX7LWUS57M52EPJMIR6A3SDQYSN7UCEX/
). Yes there's a rule that one can learn, but putting it in reverse also seems 
logical, perhaps even more so:

    [
        f(cell)
        for cell in row
        for row in matrix
    ]

Now the comprehension is 'consistently backwards', it reads more like English, 
and the usage of `cell` is right next to its definition. But of course that 
order is wrong...unless we want a nested list comprehension that produces a new 
nested list:

    [
        [
            f(cell)
            for cell in row
        ]
        for row in matrix
    ]

Again, it's not hard for an experienced coder to understand this, but for a 
beginner grappling with new concepts this is not great. Now consider how the 
same two comprehensions would be written in the new syntax:

    [
        for row in matrix:
            for cell in row:
                f(cell)
    ]
    
    [
        for row in matrix:
            [
                for cell in row:
                    f(cell)
            ]
    ]

### Power and flexibility

Comprehensions are great and I love using them. I want to be able to use them 
more often. I know I can solve any problem with a loop, but it's obvious that 
comprehensions are much nicer or we wouldn't need to have them at all. Compare 
this code:

    new_matrix = []
    for row in matrix:
        new_row = []
        for cell in row:
            try:
                new_row.append(f(cell))
            except ValueError:
                new_row.append(0)
        new_matrix.append(new_row)

with the solution using the new syntax:

    new_matrix = [
        for row in matrix: [
            for cell in row:
                try:
                    yield f(cell)
                except ValueError:
                    yield 0
        ]
    ]

It's immediately visually obvious that it's building a new nested list, there's 
much less syntax for me to parse, and the variable `new_row` has gone from 
appearing 4 times to 0!

There have been many requests to add some special syntax to comprehensions to 
make them a bit more powerful:

- [Is this PEP-able? "with" statement inside genexps / list 
comprehensions](https://mail.python.org/archives/list/python-ideas@python.org/thread/BUD46OEPBN6YW43HPPEG3P3IFDOG6KMV/#O3U3V4Q4I2GOGVFCFH67TZ355WE7XKTD)
- [Allowing breaks in generator expressions by overloading the while 
keyword](https://mail.python.org/archives/list/python-ideas@python.org/thread/6PEOE5ZXHQHAINEPQ7PTKSWYFW5OFMPQ/#ETB6ISNSB4KWQQYNMTRVJMZF4AWYCXV5)
- [while conditional in list comprehension 
??](https://mail.python.org/archives/list/python-ideas@python.org/thread/RYBBHV3YBBEIBUZPZ4WNQGKI76VSBWI5/#A36BJCUAGUBZA7FIQ3LN6UMZUYCL2LJG)

This would solve all such problems neatly.

### No trying to fit things in a single expression

The current syntax can only contain one expression in the body. This 
restriction makes it difficult to solve certain problems elegantly and creates 
an uncomfortable grey area where it's hard to decide between squeezing maybe a 
bit too much into an expression or doing things 'manually'. This can lead to 
analysis paralysis and disagreements between coders and reviewers. For example, 
which of the following is the best?

    clean = [
        line.strip()
        for line in lines
        if line.strip()
    ]
    
    stripped = [line.strip() for line in lines]
    clean = [line for line in stripped if line]
    
    clean = list(filter(None, map(str.strip, lines)))
    
    clean = []
    for line in lines:
        line = line.strip()
        if line:
            clean.append(line)
    
    def clean_lines():
        for line in lines:
            line = line.strip()
            if line:
                yield line
    
    clean = list(clean_lines())

You probably have a favourite, but it's very subjective and this kind of 
problem requires judgement depending on the situation. For example, I'd choose 
the first version in this case, but a different version if I had to worry about 
duplicating something more complex or expensive than `.strip()`. And again, 
there's an awkward sweet spot where it's hard to decide whether I care enough 
about the duplication.

What about assignment expressions? We could do this:

    clean = [
        stripped
        for line in lines
        if (stripped := line.strip())
    ]

Like the nested loops, this is tricky to parse without experience. The 
execution order can be confusing and the variable is used away from where it's 
defined. Even if you like it, there are clearly many who don't. I think the 
fact that assignment expressions were a desired feature despite being so 
controversial is a symptom of this problem. It's the kind of thing that happens 
when we're stuck with the limitations of a single expression.

The solution with the new syntax is:

    clean = [
        for line in lines:
            stripped = line.strip()
            if stripped:
                stripped
    ]

or if you'd like to use an assignment expression:

    clean = [
        for line in lines:
            if stripped := line.strip():
                stripped
    ]

I think both of these look great and are easily better than any of the other 
options. And I think it would be the clear winner in any similar situation - no 
careful judgement needed. This would become the one (and only one) obvious way 
to do it. The new syntax has the elegance of list comprehensions and the 
flexibility of multiple statements. It's completely scalable and works equally 
well from the simplest comprehension to big complicated constructions.

### Easy to change

I hate when I've already written a list comprehension but a new requirement 
forces me to change it to, say, the `.append` version. It's a tedious 
refactoring involving brackets, colons, indentation, and moving things around. 
It also leaves me with a very unhelpful `git diff`. With the new syntax I can 
easily add logic as I please and get a nice simple diff.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5UIXE23B26XPIQGPYNI575XN3NNX6JRR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Proposal: Complex comprehensions containing statements

Reply via email to