More thoughts. (Stealing your style of headers.) Just store a pointer to value =============================
Instead of having the inner dict store k_v pairs. In C, the values in our hash tables will be: struct refcell{ PyObject *value; // NULL if deleted }; It's not necessary to store the key. I think I only had it so I could mark it None in the Python implementation, to denote a deleted key. But a deleted entry could just have `cell.value is ScopeDict.NULL` (C: cell.value == NULL). The scope dict will own all values which don't have exposed references, and all exposed references (which own the value they reference). (Alternatively, store the value directly in the hash table. If something asks for a reference to it, replace the value with a PyObject that refers to it, and flag that entry in the auxilary hash table.) This might be what PyCellObject is, which is how I realized that I didn't need the key: https://docs.python.org/3.5/c-api/cell.html Deleting from scope =================== When deleting a key, don't remove the key from the inner dict, and just set the reference to NULL. Outside code should never get the raw C `refcell`, only a Python object. This makes it possible to clean up unused references when the dict expands or contracts: for each `refcell`, if it has no Pair object or its Pair object is not referenced by anything else, and if its value is NULL (i.e. deleted), don't store it in the new hash table. Get pairs before their keys are defined ======================================= When the interpreter compiles a function, it can request references which _don't exist yet_. The scope dict would simply create the entry in its inner dict and fill it in when needed. That means that each name only needs to be looked up when a function is created. scope = Scopedict() f = scope.ref('f') scope['f'] = float f.value('NaN') This would be a memory issue if many functions are created with typo'd names. But - You're not making a gigantic amount of functions in the first place. - You'll eventually remove these unused entries when you resize the inner dict. (See previous section.) I was concerned about which scope would be responsible for creating the entry, but it turns out that if you use a name in a function, every use of that name has to be for the same scope. So the following causes a NameError: def f(): def g(x): return abs(x) for i in range(30): print(g(i)) if i == 10: def abs(x): return "abs" + str(x) if d == 20: del abs and you can tell which scope to ask for the reference during function compilation. Does this simplify closures? ============================ (I haven't yet looked at Python's closure implementation.) A refcell (C struct) will be exposed as a RefCell (PyObject), which owns it. This means that RefCell is reference-counted, and if something saved a reference to it, it will persist even after its owning dict is deleted. Thus, when a scope dict is deleted, each refcell without a RefCell object is deleted (and its value is DecRef'd), and the other ones just have their RefCell object decrement a reference. Then closures are free: each inner function has refcounted references to the cells that it uses, and it doesn't need to know whether its parent is alive. (The implementation of closures involves cell objects.) Overhead ======== If inner functions are being created a lot, that's extra work. But I guess you should expect a lot of overhead if you're doing such a thing. Readonly refs ============= It might be desirable to have refs that are allowed to write (e.g. from `global` and `nonlocal`) and refs that aren't. The CellObject would just hold a count for the number of writing refs, separate from the number of refs. This might result in some optimizations for values with no writing refs. For example, it's possible to implement copying of dicts as a shallow copy if it's KNOWN that any modification would result in a call to its set/del functions, which would initiate a deep copy. On Tue, Dec 15, 2015 at 3:29 PM, Victor Stinner <victor.stin...@gmail.com> wrote: > 2015-12-15 12:23 GMT+01:00 Franklin? Lee <leewangzhong+pyt...@gmail.com>: >> I was thinking (as an alternative to versioning dicts) about a >> dictionary which would be able to return name/value pairs, which would >> also be internally used by the dictionary. This would be way less >> sensitive to irrelevant changes in the scope dictionary, but cost an >> extra pointer to each key. > > Do you have an estimation of the cost of the "extra pointer"? Impact > on memory and CPU. dict is really a very important type for the > performance of Python. If you make dict slower, I'm sure that Python > overall will be slower. I'm proposing it as a subclass. > It looks tricky to keep the dict and the pair objects consistent, > especially in term of atomaticity. You will need to keep a reference > to the pair object in the dict entry, which will also make the dict > larger (use more memory), right? Yes, but it will be about 25% bigger than the underlying dict's tables. You store an extra pointer, while the underlying tables are (hash, key, value), which is a 64-bit value and two 32-bit values. If Python moves to a compact dict implementation, it will still be 25% bigger, because the secondary table will be kept in lockstep with the compact array instead of the sparse array. >> You won't have to keep looking up keys (unless the name is deleted), and >> functions are allowed to change. For inlining, you can detect whether >> the function has been redefined by testing the saved pair.value >> against the saved function, and go into the slow path if needed (or >> recompile the inlining). > > For builtin functions, I also need to detect when a key is created in > the global namespace. How do you handle this case with pairs? (I realized you don't need to keep a key, so I threw away pairs.) Instead of detecting key insertion, I allow the creation of references before there's anything to reference. In other words, when a function is created that uses a name which isn't yet defined, an entry (to NULL) is created in the scope's inner dict, and a Python object for that entry. I really think this is a good approach to that problem. In the case of global versus __builtin__, the entry would be created in the globals() dict, which initially points to __builtin__'s entry. This would require a double dereference, but I think there's no other way to have nested ambiguous scoping like that (where you don't know where you're looking it up until you need it). If there is, the Python object can hold a "number of indirections". This would allow passing through to __builtin__, but still allow saving CellRefs to Python variables. On deletion, it would re-look-up the builtin version. If you repeatedly create and delete `map` in module scope, it would have to keep looking up the reference to the builtin when you delete. But if you're repeatedly deleting and reusing the same name, you're kind of being a jerk. This can be solved with more overhead. Alternatively, module scope dicts can be even more special, and hold a pair of references: one to the module scope *value*, one to the __builtin__ *reference*. So for a module scope reference, it will try to return its own value, and if it's NULL, it will ask the builtin reference to try. This means each module has the overhead of its own names plus the overhead of referring to module names, and builtins will have a name for... every single module's names. Eh. I'd rather just punish the jerk. In fact, don't have globals() save a reference to __builtin__'s entry unless it exists at some point. `__builtins__.__dict__.ref("argargarg", create_if_none=False) => None`. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com