Nathaniel Smith <n...@pobox.com> wrote:
>On Wed, Jun 13, 2012 at 5:04 PM, Dag Sverre Seljebotn ><d.s.seljeb...@astro.uio.no> wrote: >> On 06/13/2012 03:33 PM, Nathaniel Smith wrote: >>> I'm inclined to say therefore that we should just drop the "open >type" >>> idea, since it adds complexity but doesn't seem to actually solve >the >>> problem it's designed for. >> >> If one wants to have an "open", hassle-free enum, an alternative >would >> be to cryptographically hash the enum string. I'd trust 64 bits of >hash >> for this purpose. >> >> The obvious disadvantage is the extra space used, but it'd be a bit >more >> hassle-free compared to regular enums; you'd never have to fix the >set >> of enum strings and they'd always be directly comparable across >> different arrays. HDF libraries etc. could compress it at the storage >> layer, storing the enum mapping in the metadata. > >You'd trust 64 bits to be collision-free for all strings ever stored >in numpy, eternally? I wouldn't. Anyway, if the goal is to store an >arbitrary set of strings in 64 bits apiece, then there is no downside >to just using an object array + interning (like pandas does now), and >this *is* guaranteed to be collision free. Maybe it would be useful to >have a "heap string" dtype, but that'd be something different. Heh, we've been having this discussion before :-) The 'interned heap string dtype' may be something different, but it could be something that could meet the 'open enum' usecases (assuming they exist) in a better way than making enums complicated. Consider it a backup strategy if one can't put the open enum idea dead otherwise.. > >AFAIK all the cases where an explicit categorical type adds value over >this are the ones where having an explicit set of levels is useful. >Representing HDF5 enums or R factors requires a way to specify >arbitrary string<->integer mappings, and there are algorithms (e.g. in >charlton) that are much more efficient if they can figure out what the >set of possible levels is directly without scanning the whole array. For interned strings, the set of strings present could be stored in the array in principle (though I guess it would be very difficult to implement in current numpy). The perfect hash schemes we've explored on the Cython list lately uses around 10-20 microseconds on my 1.8 GHz for 64-element table rehashing (worst case insertion, happens more often than with insertion in regular hash tables) and 0.5-2 nanoseconds for a lookup in L1 (which always hits on first try if the entry is in the table). Dag >_______________________________________________ >NumPy-Discussion mailing list >NumPy-Discussion@scipy.org >http://mail.scipy.org/mailman/listinfo/numpy-discussion -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion