Thanks a lot. Quick and brief responses below... "Duncan Murdoch" <murd...@stats.uwo.ca> wrote in message news:4b90f134.6070...@stats.uwo.ca... > Matthew Dowle wrote: >> I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No >> response resulted in a retry() after a Sys.sleep(10 days). >> >> If its a "yes" or "maybe" then I could proceed to try it, test it, and >> present the test results and timings to you along with the patch. It >> would be on 32bit Ubuntu first, and I would need to either buy, rent time >> on, or borrow a 64bit machine to be able to then test there, owing to the >> nature of the suggestion. >> >> If its "no", "bad idea because..." or "we were already working on it, or >> better", then I won't spend any more time on it. >> >> Matthew >> >> >> "Matthew Dowle" <mdo...@mdowle.plus.com> wrote in message >> news:hlu4qh$l7...@dough.gmane.org... >> >>> Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes >>> sense to hash the pointer itself rather than the string it points to? >>> In other words could the SEXP pointer be cast to unsigned int and the >>> usual scatter be called on that as if it were integer? >>> > Two negative but probably not fatal issues: > > Pointers and ints are not always the same size. In Win64, ints are 32 > bits, pointers are 64 bits. (Can we be sure there is some integer type > the same size as a pointer? I don't know, ask a C expert.) No we can't be sure. But we could test at runtime, and if the assumption wasn't true, then revert to the existing method.
> > We might want to save the hash to disk. On restore, the pointer based > hash would be all wrong. (I don't know if we actually do ever save a hash > to disk. ) The hash table in unique.c appears to be a temporary private hash, different to the global R_StringHash. Its private hash appears to be used only while the call to unique runs, then free'd. Thats my understanding anyway. The suggestion is not to alter the global R_StringHash in any way at all, which is the one that might be saved to disk now or in the future. > > Duncan Murdoch >>> shash would look like a slightly modified version of ihash like this : >>> >>> static int shash(SEXP x, int indx, HashData *d) >>> { >>> if (STRING_ELT(x,indx) == NA_STRING) return 0; >>> return scatter((unsigned int) (STRING_ELT(x,indx), d); >>> } >>> >>> rather than its current form which appears to hash the string it points >>> to : >>> >>> static int shash(SEXP x, int indx, HashData *d) >>> { >>> unsigned int k; >>> const char *p; >>> if(d->useUTF8) >>> p = translateCharUTF8(STRING_ELT(x, indx)); >>> else >>> p = translateChar(STRING_ELT(x, indx)); >>> k = 0; >>> while (*p++) >>> k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */ >>> return scatter(k, d); >>> } >>> >>> Looking at sequal, below, and reading its comments, if the pointers are >>> equal it doesn't look at the strings they point to, which lead to the >>> question above. >>> >>> static int sequal(SEXP x, int i, SEXP y, int j) >>> { >>> if (i < 0 || j < 0) return 0; >>> /* Two strings which have the same address must be the same, >>> so avoid looking at the contents */ >>> if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1; >>> /* Then if either is NA the other cannot be */ >>> /* Once all CHARSXPs are cached, Seql will handle this */ >>> if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING) >>> return 0; >>> return Seql(STRING_ELT(x, i), STRING_ELT(y, j)); >>> } >>> >>> Matthew >>> >>> >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel