On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 06/10/2012 10:23 AM, Robert Bradshaw wrote: >> >> On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn >> <d.s.seljeb...@astro.uio.no> wrote: >>> >>> On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >>>> >>>> >>>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> Robert Bradshaw<rober...@gmail.com> wrote: >>>>> >>>>>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >>>>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>>>> >>>>>>> >>>>>>> I'd love to not do interning, but I see no way around it. >>>>>> >>>>>> >>>>>> >>>>>> No, I want to use the lower 64 bits by default, but always have the >>>>>> top 96 bits around to allow using this mechanism in "secure" mode at a >>>>>> slight penalty. md5 is out because there are known collisions. (Yes, >>>>>> sha-1 may succumb sooner rather than later, theoretical weaknesses >>>>>> have been shown, so we could look to using something else (hopefully >>>>>> still shipped with Python). >>>>> >>>>> >>>>> >>>>> But very few users are going to know about this. What's the odds that >>>>> the >>>>> user who decide to trigger JIT-compilation with function signatures >>>>> that >>>>> varies based on the input will know about the option and turn it on and >>>>> also >>>>> recompile all his/her C extension modules? >>>>> >>>>> In practice, such an option would always stay at its default value. If >>>>> we >>>>> leave it to secure by default and start teaching it to users from the >>>>> start...but that's a big price to pay. >>>> >>>> >>>> >>>> Yes, it's not ideal from this perspective. >>>> >>>>> And if you *do* want to run in secure mode, it will be a lot slower >>>>> than >>>>> interning. >>>> >>>> >>>> >>>> Are you thinking that the 64-bit interned pointer would be used as the >>>> hash? In this case all hashtables would have to be constructed at >>>> runtime, which means it needs to be really, really cheap (well under a >>>> milisecond, I'm sure Sage has>1000 classes,>10000 methods it imports >>>> at startup). Also I'm not sure how the very-uneven distribution would >>>> play out for constructing perfect hastables (perhaps it won't hurt, >>>> there's likely to be long runs of consecutive values in some cases. >>> >>> >>> >>> No, I'm thinking that callsites need both the 64-bit interned char* and >>> the >>> 64-bit hash of the *contents*. They use the hash to figure out the >>> position, >>> then compare by ID. >> >> >> Ah, I missed that bit. OK, yes, that could work well. > > > Ah, we've been talking past one another for some time then. OK, let's settle > on that. > > >> >>> The hash is not stored in callees, it's discarded after figuring out the >>> table layout. >>> >>> (There was this idea that if the char* has least significant bit set, >>> we'd >>> hash it directly rather than dereference it, but let's ignore that for >>> now.) >> >> >> (For the purpose of this discussion, it's part of the "interning" step.) >> >>> I don't think under a millisecond is unfeasible to hash smallish tables >>> -- >>> we could put the pointer through a cheap hash to create more entropy (for >>> the perfect hashing, being able to select a hash function through the>>r >>> is >>> important, so you can't just use the pointer directly -- but there are >>> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) >> >> >> Just a sec, we're not hashing pointers, but the full signature itself, >> right? For our hash function we need >> >> (1) Collision free on 64-bits (for non-malicious use). >> (2) Good distribution (including for short strings, which is harder to >> come by). >> (2b) Any small subset of bits should have property (2). >> (3) Ideally easy to reference (e.g. "md5" is better than "these 100 >> lines of C code"). >> >> Cheap runtime construction is still ideal, but much less of an issue >> if hashes (and perfect tables) can be constructed at compile time, >> which I think this scheme allows. > > > Yes, 64 bits of md5 then?
+1 for me. > ulib contains "100 lines of C code" for md5 > anyway, if one doesn't want to go through Python hashlib (I imagine e.g. > hashlib might be unavailable somewhere as it relies on openssl and there's > license war going on vs. gnutls and so on. And the md5 module is > deprecated.). Just the interface, right? (hashlib should be used instead...) >>> That would save us a register and make the instructions shorter in some >>> places I guess...I think it's really miniscule, it's not like the effect >>> of >>> load of a global variable. But if you like this approach I can benchmark >>> C-written hashtable creation and see. >> >> >> This will have value in and of itself (both the implementation and the >> benchmarks). > > > Will do (eventually, less spare time in coming week). > > About signatures, a problem I see with following the C typing is that the > signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq" > on 32-bit Linux, and so on. I think that would be really bad. This is why I suggested promotion for scalars (divide ints into <=sizeof(long) and sizeof(long) < x <= sizeof(long long)), checked at C compile time, though I guess you consider that evil. I don't consider not matching really bad, just kind of bad. > "l" must be banished -- but then one might as well do "i4i8i8". > > Designing a signature hash where you select between these at compile-time is > perhaps doable but does generate a lot of code and makes everything > complicated. It especially gets messy when you're trying to pre-compute tables. > I think we should just start off with hashing at module load > time when sizes are known, and then work with heuristics and/or build system > integration to improve on that afterwards. Finding 10,000 optimal tables at runtime better be really cheap than for Sage's sake :). - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel