On Fri, Feb 12, 2016 at 3:45 PM, Paulo da Silva <p_s_d_a_s_i_l_v_a...@netcabo.pt> wrote: >> Correct. Two equal strings, passed to sys.intern(), will come back as >> identical strings, which means they use the same memory. You can have >> a million references to the same string and it takes up no additional >> memory. > I have being playing with this and found that it is not always true! > For example: > > In [1]: def f(s): > ...: print(id(sys.intern(s))) > ...: > > In [2]: import sys > > In [3]: f("12345") > 139805480756480 > > In [4]: f("12345") > 139805480755640 > > In [5]: f("12345") > 139805480756480 > > In [6]: f("12345") > 139805480756480 > > In [7]: f("12345") > 139805480750864 > > I think a dict, as MRAB suggested, is needed. > At the end of the store process I may delete the dict.
I'm not 100% sure of what's going on here, but my suspicion is that a string that isn't being used is allowed to be flushed from the dictionary. If you retain a reference to the string (not to its id, but to the string itself), you shouldn't see that change. By doing the dict yourself, you guarantee that ALL the strings will be retained, which can never be _less_ memory than interning them all, and can easily be _more_. >> But I reiterate: Don't even bother with this unless you know your >> program is running short of memory. > > Yes, it is. > This is part of a previous post (sets of equal files) and I need lots of > memory for performance reasons. I only have 2G in this computer. How many files, roughly? Do you ever look at the contents of the files? Most likely, you'll be dwarfing the files' names with their contents. Unless you actually have over two million unique files, each one with over a thousand characters in the name, you can't use all that 2GB with file names. If virtual memory is active, all that'll happen is that you dip into the swapper / page file a bit... and THAT is when you start looking at reducing memory usage. Don't bother optimizing until you need to, and even then, you measure first to see what part of the program actually needs to be optimized. > I already had implemented a solution. I used two dicts. One to map > dirnames to an int handler and the other to map the handler to dir > names. At the end I deleted the 1st. one because I only need to get the > dirname from the handler. But I thought there should be a better choice. If all your dir names are interned, their identities (approximately the values returned by id(), but not quite) will be those handlers for you, without any overhead and without any complexity. ChrisA -- https://mail.python.org/mailman/listinfo/python-list