[Python-ideas] Re: Sets for easy interning(?)

Andrew Barnert via Python-ideas Tue, 03 Dec 2019 10:28:12 -0800

> On Dec 3, 2019, at 03:41, Steven D'Aprano <[email protected]> wrote:
> On Tue, Dec 03, 2019 at 01:54:44AM -0800, Andrew Barnert via Python-ideas 
> wrote:
>>>>> On Dec 2, 2019, at 16:27, Soni L. <[email protected]> wrote:
>>>> Even use-cases where you have different objects whose differences are 
>>>> ignored for __eq__ and __hash__ and you want to grab the one from the set 
>>>> ignoring their differences would benefit from this.
>> A more concrete use case might help make the argument better.
> 
> Is interning concrete enough?


No. A concrete use for interning would be, but interning itself isn’t.

If you’re using interning for functionality, to distinguish two equal strings 
that came from different inputs or processes, your code is probably broken. 
Python is allowed to merge distinct equal values of builtin immutable types 
whenever it wants to. And different interpreters, and even different CPython 
versions, may do that in different cases. That means any code that relies on 
the result of is on two equal immutable values is wrong.

If you don’t care about portability or future compatibility, you could always 
work out the rules for one interpreter, version, and build. But they’re pretty 
complicated. IIRC, the current rules for a default build of CPython are 
something like this:

* Two equal string literals in the same scope are identical.
* Two string expressions in the same scope with equal values that the optimizer 
is able to turn into constants are identical.
* There’s some rule for interactive literals that I don’t remember, so even 
though two top-level interactive statements are compiled and evaluated as 
separate scopes they can still share constant string values.
* Two empty strings are identical if they’re created by any builtin, but it’s 
possible to create distinct ones with the C API.
* Some single-character strings are treated the same as the empty string; the 
exact set is a compile-time option but defaults to all printable ASCII 
characters or all ASCII characters or something like that.
* Copying a string with [:] or even copy.deepcopy gives you the same string.

And there are similar but not identical rules for bytes and int, while bools 
and None are stricter (even C extensions can’t give you a distinct but equal 
None value), and float and tuple are looser (inf is a singleton like “”, but 
every float('inf’) returns a new value anyway). And I can’t remember how tuple 
scope merging changed when tuples deeper than 1 were allowed to become 
constants.

So, what can you actually safely do with interning?

You could try to optimize your code by interning a bunch of your strings and 
then using `a is b or a == b` instead of just `a == b`, but this will almost 
always make it slower, not faster.

What about optimizing for memory instead of speed? Interning a string would 
waste, say, 24 bytes, but if you have 1000 copies of that same string, N+24 is 
a lot better than N*1000. But what kind of application are you building that 
stores vast numbers of duplicates of strings and isn’t storing them in a set or 
dict or database or custom b-tree or trie or whatever? And once you do that, it 
doesn’t matter whether the boxed Python values are interned, only whether the 
values inside that data structure are collapsed (and in all those cases, they 
either are or trivially could be).

Maybe you can come up with some application that does need to store a billion 
copies of only a thousand strings, and needs to store them in a list (or a 
billion separate locals, I guess…). If so, then you’ve got a concrete use case.

> The Python interpreter interns at least two kinds of objects: ints and 
> strings, or rather, *some* ints and strings.

This is of course the CPython interpreter; different interpreters will be 
different.

> Back in Python 1.5, there 
> was a built-in for interning strings:
> 
>   # Yes I still have a 1.5 interpreter :-)
>>>> a = intern("hello world")
>>>> b = intern("hello world")
>>>> a is b
>   1

And (at least in Pythonista, which currently embeds CPython 3.6.1, but I’m not 
sure its REPL behavior is always identical to the stock one):

>>> a = 'hello'
>>> b = 'hello'
>>> a is b
   True

By the way, intern was still there until 2.7, but in that list of “we can’t 
deprecate these but please never use them” functions at the end of builtins, so 
you didn’t actually need 1.5 to test it. But I understand; you can never be too 
sure that the 2.0 license won’t turn out to be as unusable as the 1.6 license, 
so you need something to fall back on. :)


_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/GI5YSUB5NSXQDZNML7EGPVT7RA5BTSDY/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Sets for easy interning(?)

Reply via email to