On 23Apr2019 0008, Inada Naoki wrote:
On Tue, Apr 23, 2019 at 2:54 PM Steve Dower <steve.do...@python.org> wrote:

On 22Apr2019 2143, Inada Naoki wrote:
On Tue, Apr 23, 2019 at 11:30 AM Steve Dower <steve.do...@python.org> wrote:

Or possibly just "dict(existing_dict).update(new_items)".


Do you mean .update accepts values tuple?
I can't think it's

Not sure what you were going to go on to say here, but why not?

Sorry, I sent mail without finishing.

dict.update() has too many overloading.
Adding values_tuple is impossible without breaking backward compatibility.

But I think you're saying about items_sequence, not values.

Right. I'm specifically trying to avoid changing public APIs at all (including adding anything new, if possible) by identifying suitable patterns that we can handle specially to provide a transparent speed improvement.

If it's a key-sharing dict, then all the keys are strings. We know that
when we go to do the update, so we can intern all the strings (going to
do that anyway) and then it's a quick check if it already exists. If
it's a regular dict, then we calculate hashes as normal. Updating the
value is just a decref, incref and assignment.

There are some problem.

1. Searching hash table is not zero-cost, comparing to appending to sequence.
    This cost is very near to building new hash tables.

If we know that you're sharing keys with the new items then we can skip the search. This was my point about the d2 = copy(d1); d2.update(zip(d2, values)) idea:

def update(self, items):
    if isinstance(items, ZipObject): # whatever the type is called
        if are_sharing_keys(self, items.sequence_1):
            # fast update from iter(items.sequence_2)
            return
    # regular update from iter(items)

Totally transparent and encourages composition of existing builtins. It's a bit of a trick and may not be as obvious as a new method, but it's backwards compatible at least as far as ordered dicts (which is a requirement of any of these approaches anyway, yes?)

2. In my proposal, main user is csv.DictReader or sql.DictCursor.
    They parse only values on each rows.   So they need to use map.

In that case, use a private helper. _csv already has a native module. We don't need to add new public APIs for internal optimisations, provided there is a semantically equivalent way to do it without the internal API.

3. (CPython only) dict.copy(), dict(dict), and dict.update() are general purpose
    methods.  There is no obvious place to start using key-sharing dict.

See my reply to Glenn, but potentially fromkeys() could start with the key-sharing dict and then copy()/dict() could continue sharing it (hopefully they already do?).

That's why I proposed specific method / function for specific purpose.


Note that it .update() would probably require a dict or key/value tuples
here - but if you have the keys in a tuple already then zip() is going
to be good enough for setting it (in fact, zip(existing_dict,
new_values) should be fine, and we can internally special-case that
scenario, too).

If *CPython* specialized dict(zip(dict, values)),  it still be CPython
implementation detail.
Do you want recommend using such CPython hacky optimization?
Should we use such optimization in stdlib, even if it will be slower
than dict(zip(keys_tuple, values)) on some other Python implementations?

We do "hacky" optimisations everywhere :) The point of the runtime is to let users write code that works and we do the effort behind the scenes to make it efficient. We're not C - we're here to help our users.

The point is that it will work on other implementations - including previous versions of CPython - and those are free to optimise it however they like.

Or do you propose making dict(zip(dict, values)) optimization as
language specification?

Definitely not! It's just a pattern that we have the ability to recognize and optimize at runtime, so why not do it?

One obvious advantage of having DictBuilder is it is for specific
purpose.  It has at least same performance to dict(zip(keys, values))
on all Python implementations.
Libraries like csv parser can use it without worrying about its performance
on Python other than CPython.

A singular purpose isn't necessarily an obvious advantage. We're better off with generic building blocks that our users can compose in ways that were originally non-obvious (and then as patterns emerge we can look at ways to simplify or formalise them).

(Randomizing side note: is this scenario enough to make a case for a
built-in data frame type?)

https://xkcd.com/927/

Yep. The difference is that as the language team, our standard wins by default ;)

(For those who don't click links, it's pointing at the "let's make a new standard" XKCD comic)

* when you only d2.update existing keys, no need to rebuild the table
* a duplicated key overwrites multiple times - what else are you going
to do?

But all keys should be looked up.  It is very similar overhead to rebuilding
hash table.

See my suggestion above - when we know the keys are shared, we can skip the lookup, and there are ways we can detect that they're shared. (Perhaps it is also faster to start by assuming they are shared and test each one, rather than assuming they are unshared? That might be worth testing.)

Cheers,
Steve
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to