On Fri, Jun 26, 2020 at 10:10:22PM +1000, Chris Angelico wrote:
> On Fri, Jun 26, 2020 at 7:58 PM Steven D'Aprano <st...@pearwood.info> wrote:
> > Most importantly, it matches the way people think about the task:
> >
> >     # Task: look for duplicates
> >     if element in seen:
> >         # it's a duplicate
> >         ...
> >     else:
> >         # never seen before, so remember it
> >         ...
> 
> Do you do this:
> 
>     if file exists:
>         # read the file
>     else:
>         # create the file

No, because I have learned that this is both unreliable and dangerous. 
But that's how I *think* about the problem, at least initially. I 
expect that if you were prototyping the code on a whiteboard, you 
probably would too.

Checking existence of an element in a set is not comparable to file 
access. File access has a lot more that can go wrong than just whether 
or not the file exists. There are permission errors and transient errors 
that bulletproof code needs to deal with. And file access is always 
concurrent, since the OS is concurrent. (Unless you have the luxury 
of working on a single user, single process OS.)

So long as you know you are dealing with hashable objects, `set.add` 
should always succeed.


> Or would you try/except around an open call? The TOCTOU issue may be
> less obvious with sets, but it's still a reasonable thing to concern
> yourself with.

Unless you are in the [Quadrant of Hell](https://imgur.com/a/jjA52vI) 
you don't even need to think about this for sets. There are no Time 
Of Check to Time Of Use issues worth thinking about in the given 
example, and frankly Chris I doubt that you surround ever set access 
with locks in single-threaded code because TOCTOU is "stronger" (more 
important?) than writing clear and simple code. And if you do, you're 
probably wondering why Python is so slow *wink*

But I'll grant you that I wasn't thinking about threaded code. The 
example given was single-threaded, but maybe a better example would 
involve threading. But that supports my point: thinking concurrently 
doesn't come naturally.

Is this check and insert version of `add` threadsafe? If not, then you 
still need manual locking and this proposal doesn't get you much; if 
anything it's an attractive nuisance because it looks atomic but isn't:

    seen = {}
    assert seen.add(None) is False

This could fail in threaded code if the underlying `add` method 
is not thread-safe:

- your thread calls `add`, which checks that None is not an element;
- between the check and the actual insertion, another thread adds None;
- and your thread redundently adds it a second time and returns False.

I don't know about Python sets, but most Java collections are not 
thread safe unless explicitly documented as such, so I wouldn't be the 
least bit surprised if this could happen.

So `add` would have to be atomic at the C level. If it is already 
atomic, great, that's one objection neutralised. But if not, then will 
making it atomic have a performance impact?

I'm not very interested in slowing down every single `set.add` just so I 
have the opportunity to write less understandable code when I don't need 
to. But if `set.add` is already atomic and hence thread safe, and the 
other issues with this proposal are addressed, then I don't object to 
this proposal. (I reserve the right to change my mind :-)

Just don't tell me that this:

    was_already_there = seen.add(element)
    if was_already_there:
        ...

is a more natural way of thinking about the problem of checking for 
duplicates than:

    if element in seen:
        ...


> > This idiom works with sets, it works with lists, it works with trees, it
> > works with anything that supports membership testing and inserting into
> > a collection. It is the natural way to think about the task.
> 
> So would "ensure this is here, and let me know if you created it".

Are you proposing to add this to lists, deques, dicts, Mappings, 
Sequences and all other collections? I think you'll need a PEP :-)


> > But either way, you also have to decide whether the `add` (or the new
> > method) should *unconditionally* insert the element, or only do so if it
> > wasn't present. This makes a big difference:
> >
> >     seen = {2}
> >     already_there = seen.add(2.0)
> >
> >
> > At this point, is `seen` the set {2} or {2.0}? Justify why one answer is
> > the best answer.
> >
> 
> It should do the same thing that happens already: keep the existing
> one.

I actually thought that the existing behaviour was the opposite, and 
for once I failed to check it before posting. Oops!

But that's not really important. What is important is the question of 
which of these we wish to match:

    # 1
    if element in seen:
        print("duplicate element")
    seen.add(element)

    # 2
    if element in seen:
        print("duplicate element")
    else:
        seen.add(element)

They have different semantics, and I can imagine cases where both would 
be useful. Although I suppose option 1 is moot if sets don't over-write 
the existing element.


-- 
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/VMWWTZQXJLLHYOODX6UVSKP4D6JPPYT2/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to