Herve, I think you code just confirms what I said -- for small nt for() wins, otherwise memcpy wins. Taking your measurements (they are a bit crude since they measure overhead as well):
ginaz:sandbox$ time ./hmc.mc 1 real 0m7.294s user 0m7.239s sys 0m0.054s ginaz:sandbox$ time ./hmc 1 real 0m3.773s user 0m3.746s sys 0m0.024s so for() is about 2x faster ginaz:sandbox$ time ./hmc 3 real 0m4.751s user 0m4.718s sys 0m0.023s ginaz:sandbox$ time ./hmc.mc 3 real 0m3.098s user 0m3.051s sys 0m0.045s memcpy is about 50% faster. It also proves me right when I said we should only special-case the common case of scalar recycling and use memcpy for everything else. Cheers, Simon On Apr 23, 2010, at 9:21 PM, Hervé Pagès wrote: > Follow up... > > Hervé Pagès wrote: >> Hi Matthew, >> Matthew Dowle wrote: >>> Just to add some clarification, the suggestion wasn't motivated by speeding >>> up a length 3 vector being recycled 3.3 million times. But its a good >>> point that any change should not make that case slower. I don't know how >>> much vectorCopy is called really, DUPLICATE_ATOMIC_VECTOR seems more >>> significant, which doesn't recycle, and already had the FIXME next to it. >>> >>> Where copyVector is passed a large source though, then memcpy should be >>> faster than any of the methods using a for loop through each element >>> (whether recycling or not), allowing for the usual caveats. What are the >>> timings like if you repeat the for loop 100 times to get a more robust >>> timing ? It needs to be a repeat around the for loop only, not the >>> allocVector whose variance looks to be included in those timings below. >>> Then increase the size of the source vector, and compare to memcpy. >> On my system (DELL LATITUDE laptop with 64-bit 9.04 Ubuntu): >> #include <stdio.h> >> #include <string.h> >> #include <stdlib.h> >> void *memcpy2(char *dest, const char *src, size_t n) >> { >> int i; >> for (i = 0; i < n; i++) *(dest++) = *(src++); >> return dest; >> } >> int main() >> { >> int n, kmax, k; >> char *x, *y; >> n = 25000000; >> kmax = 100; >> x = (char *) malloc(n); >> y = (char *) malloc(n); >> for (k = 0; k < kmax; k++) >> //memcpy2(y, x, n); >> memcpy(y, x, n); >> return 0; >> } >> Benchmarks: >> n = 25000000, kmax = 100, memcpy2: >> real 0m8.123s >> user 0m8.077s >> sys 0m0.040s >> n = 25000000, k = 100, memcpy: >> real 0m1.076s >> user 0m1.004s >> sys 0m0.060s >> n = 25000, kmax = 100000, memcpy2: >> real 0m8.033s >> user 0m8.005s >> sys 0m0.012s >> n = 25000, kmax = 100000, memcpy: >> real 0m0.353s >> user 0m0.352s >> sys 0m0.000s >> n = 25, kmax = 100000000, memcpy2: >> real 0m8.351s >> user 0m8.313s >> sys 0m0.008s >> n = 25, kmax = 100000000, memcpy: >> real 0m0.628s >> user 0m0.624s >> sys 0m0.004s >> So depending on the size of the memory area to copy, GNU memcpy() is >> between 7.5x and 22x faster than using a for() loop. You can reasonably >> expect that the authors of memcpy() have done their best to optimize >> the code for most platforms they support, for big and small memory >> areas, and that if there was a need to branch based on the size of the >> area, that's already done *inside* memcpy() (I'm just speculating here, >> I didn't look at memcpy's source code). > > So for copying a vector of integer (with recycling of the source), > yes, a memcpy-based implementation is much faster, for long and small > vectors (even for a length 3 vector being recycled 3.3 million > times ;-) ), at least on my system: > > nt = 3; ns = 10000000; kmax = 100; copy_ints: > > real 0m1.206s > user 0m1.168s > sys 0m0.040s > > nt = 3; ns = 10000000; kmax = 100; copy_ints2: > > real 0m6.326s > user 0m6.264s > sys 0m0.052s > > > Code: > ======================================================================= > #include <stdio.h> > #include <string.h> > #include <stdlib.h> > > void memcpy_with_recycling_of_src(char *dest, size_t dest_nblocks, > const char *src, size_t src_nblocks, > size_t blocksize) > { > int i, imax, q; > size_t src_size; > > imax = dest_nblocks - src_nblocks; > src_size = src_nblocks * blocksize; > for (i = 0; i <= imax; i += src_nblocks) { > memcpy(dest, src, src_size); > dest += src_size; > i += src_nblocks; > } > q = dest_nblocks - i; > if (q > 0) > memcpy(dest, src, q * blocksize); > return; > } > > void copy_ints(int *dest, int dest_length, > const int *src, int src_length) > { > memcpy_with_recycling_of_src((char *) dest, dest_length, > (char *) src, src_length, > sizeof(int)); > } > > /* the copyVector() way */ > void copy_ints2(int *dest, int dest_length, > const int *src, int src_length) > { > int i; > > for (i = 0; i < dest_length; i++) > dest[i] = src[i % src_length]; > } > > int main() > { > int nt, ns, kmax, k; > int *t, *s; > > nt = 3; > ns = 10000000; > kmax = 100; > t = (int *) malloc(nt * sizeof(int)); > s = (int *) malloc(ns * sizeof(int)); > for (k = 0; k < kmax; k++) > //copy_ints(s, ns, t, nt); > copy_ints2(s, ns, t, nt); > return 0; > } > > Note that the function that actually does the job is > memcpy_with_recycling_of_src(). It can be reused for copying > vectors with elements of an arbitrary size. > > Cheers, > H. > >> Cheers, >> H. >>> >>> Matthew >>> >>> "William Dunlap" <wdun...@tibco.com> wrote in message >>> news:77eb52c6dd32ba4d87471dcd70c8d70002ce6...@na-pa-vbe03.na.tibco.com... >>> If I were worried about the time this loop takes, >>> I would avoid using i%nt. For the attached C code >>> compile with gcc 4.3.3 with -O2 I get >>> > # INTEGER() in loop >>> > system.time( r1 <- .Call("my_rep1", 1:3, 1e7) ) >>> user system elapsed >>> 0.060 0.012 0.071 >>> >>> > # INTEGER() before loop >>> > system.time( r2 <- .Call("my_rep2", 1:3, 1e7) ) >>> user system elapsed >>> 0.076 0.008 0.086 >>> >>> > # replace i%src_length in loop with j=0 before loop and >>> > # if(++j==src_length) j=0 ; >>> > # in the loop. >>> > system.time( r3 <- .Call("my_rep3", 1:3, 1e7) ) >>> user system elapsed >>> 0.024 0.028 0.050 >>> > identical(r1,r2) && identical(r2,r3) >>> [1] TRUE >>> >>> The C code is: >>> #define USE_RINTERNALS /* pretend we are in the R kernel */ >>> #include <R.h> >>> #include <Rinternals.h> >>> >>> >>> SEXP my_rep1(SEXP s_src, SEXP s_dest_length) >>> { >>> int src_length = length(s_src) ; >>> int dest_length = asInteger(s_dest_length) ; >>> int i,j ; >>> SEXP s_dest ; >>> PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; >>> if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ; >>> for(i=0;i<dest_length;i++) { >>> INTEGER(s_dest)[i] = INTEGER(s_src)[i % src_length] ; >>> } >>> UNPROTECT(1) ; >>> return s_dest ; >>> } >>> SEXP my_rep2(SEXP s_src, SEXP s_dest_length) >>> { >>> int src_length = length(s_src) ; >>> int dest_length = asInteger(s_dest_length) ; >>> int *psrc = INTEGER(s_src) ; >>> int *pdest ; >>> int i ; >>> SEXP s_dest ; >>> PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; >>> pdest = INTEGER(s_dest) ; >>> if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ; >>> /* end of boilerplate */ >>> for(i=0;i<dest_length;i++) { >>> pdest[i] = psrc[i % src_length] ; >>> } >>> UNPROTECT(1) ; >>> return s_dest ; >>> } >>> SEXP my_rep3(SEXP s_src, SEXP s_dest_length) >>> { >>> int src_length = length(s_src) ; >>> int dest_length = asInteger(s_dest_length) ; >>> int *psrc = INTEGER(s_src) ; >>> int *pdest ; >>> int i,j ; >>> SEXP s_dest ; >>> PROTECT(s_dest = allocVector(INTSXP, dest_length)) ; >>> pdest = INTEGER(s_dest) ; >>> if(TYPEOF(s_src) != INTSXP) error("src must be integer data") ; >>> /* end of boilerplate */ >>> for(j=0,i=0;i<dest_length;i++) { >>> *pdest++ = psrc[j++] ; >>> if (j==src_length) { >>> j = 0 ; >>> } >>> } >>> UNPROTECT(1) ; >>> return s_dest ; >>> } >>> >>> Bill Dunlap >>> Spotfire, TIBCO Software >>> wdunlap tibco.com >>> >>>> -----Original Message----- >>>> From: r-devel-boun...@r-project.org >>>> [mailto:r-devel-boun...@r-project.org] On Behalf Of Romain Francois >>>> Sent: Wednesday, April 21, 2010 12:32 PM >>>> To: Matthew Dowle >>>> Cc: r-de...@stat.math.ethz.ch >>>> Subject: Re: [Rd] suggestion how to use memcpy in duplicate.c >>>> >>>> Le 21/04/10 17:54, Matthew Dowle a écrit : >>>>>> From copyVector in duplicate.c : >>>>> void copyVector(SEXP s, SEXP t) >>>>> { >>>>> int i, ns, nt; >>>>> nt = LENGTH(t); >>>>> ns = LENGTH(s); >>>>> switch (TYPEOF(s)) { >>>>> ... >>>>> case INTSXP: >>>>> for (i = 0; i< ns; i++) >>>>> INTEGER(s)[i] = INTEGER(t)[i % nt]; >>>>> break; >>>>> ... >>>>> >>>>> could that be replaced with : >>>>> >>>>> case INTSXP: >>>>> for (i=0; i<ns/nt; i++) >>>>> memcpy((char *)DATAPTR(s)+i*nt*sizeof(int), (char >>>> *)DATAPTR(t), >>>>> nt*sizeof(int)); >>>>> break; >>>> or at least with something like this: >>>> >>>> int* p_s = INTEGER(s) ; >>>> int* p_t = INTEGER(t) ; >>>> for( i=0 ; i < ns ; i++){ >>>> p_s[i] = p_t[i % nt]; >>>> } >>>> >>>> since expanding the INTEGER macro over and over has a price. >>>> >>>>> and similar for the other types in copyVector. This won't >>>> help regular >>>>> vector copies, since those seem to be done by the >>>> DUPLICATE_ATOMIC_VECTOR >>>>> macro, see next suggestion below, but it should help >>>> copyMatrix which calls >>>>> copyVector, scan.c which calls copyVector on three lines, >>>> dcf.c (once) and >>>>> dounzip.c (once). >>>>> >>>>> For the DUPLICATE_ATOMIC_VECTOR macro there is already a >>>> comment next to it >>>>> : >>>>> >>>>> <FIXME>: surely memcpy would be faster here? >>>>> >>>>> which seems to refer to the for loop : >>>>> >>>>> else { \ >>>>> int __i__; \ >>>>> type *__fp__ = fun(from), *__tp__ = fun(to); \ >>>>> for (__i__ = 0; __i__< __n__; __i__++) \ >>>>> __tp__[__i__] = __fp__[__i__]; \ >>>>> } \ >>>>> >>>>> Could that loop be replaced by the following ? >>>>> >>>>> else { \ >>>>> memcpy((char *)DATAPTR(to), (char *)DATAPTR(from), >>>> __n__*sizeof(type)); \ >>>>> }\ >>>>> >>>>> In the data.table package, dogroups.c uses this technique, >>>> so the principle >>>>> is tested and works well so far. >>>>> >>>>> Are there any road blocks preventing this change, or is >>>> anyone already >>>>> working on it ? If not then I'll try and test it (on >>>> Ubuntu 32bit) and >>>>> submit patch with timings, as before. Comments/pointers >>>> much appreciated. >>>>> Matthew >>>>> >>>>> ______________________________________________ >>>>> R-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>> >>>>> >>>> >>>> -- >>>> Romain Francois >>>> Professional R Enthusiast >>>> +33(0) 6 28 91 30 30 >>>> http://romainfrancois.blog.free.fr >>>> |- http://bit.ly/9aKDM9 : embed images in Rd documents >>>> |- http://tr.im/OIXN : raster images and RImageJ >>>> |- http://tr.im/OcQe : Rcpp 0.7.7 >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel