Yes... as is so often the case, the primary benefit of the exercise is an increased understanding.
Speaking of which, what are you using to get the assembly representations of the memcpy internals? Thanks, -- Raul On Wed, Feb 4, 2015 at 3:54 PM, Joe Bogner <[email protected]> wrote: > I chased down the difference in the performance in my infix and J's infix > > As I figured, J's algorithm was more efficient. I implemented the same > algorithm and cut the time by nearly a factor of 2 to 0.78 seconds. > Still behind J's 0.43 seconds. Inspecting the assembly shows that TCC > is executing similar instructions but somewhat different. Also, I > might assume that Visual C's memcpy is more optimized than the TCC > version (see below) > > In any case, this serves as a reminder as to why interpreted languages > can be faster than compiled ones. J's algorithm was written more > efficiently than mine: > > J (effectively) > > char *x=newMem; > char *y=mem; > for(int i=0;i<d;i++) { > memcpy(x,y,infixLen); > x+=infixLen; > y+=1; > } > > My attempt: > > long offset = 0; > long idx = 0; > > while(1) { > for(int q=0;q<infixLen;q++) { > newMem[idx++] = mem[q+offset]; > } > offset++; > if ((len-offset) <= infixLen) { > break; > } > } > > Mine has an extra loop that can switched to memcpy which can do > smarter things than a loop knowing the size ahead of time. > > It just so happens that msvcrt's memcpy has special code for moves of > different sizes: > > In the case of copying 9 bytes > > It looks like it can move 8 bytes in only a few instructions instead > of each individual byte that the loop would do > > MoveSmall9: > 000007FEC5F53EC8 movzx rax,byte ptr [rdx] > 000007FEC5F53ECC mov rcx,qword ptr [rdx+1] > 000007FEC5F53ED0 mov byte ptr [r10],al > 000007FEC5F53ED3 mov qword ptr [r10+1],rcx > 000007FEC5F53ED7 mov rax,r11 > 000007FEC5F53EDA ret > > > Compared to: > > MoveSmall8: > 000007FEC5F53EBE mov rax,qword ptr [rdx] > 000007FEC5F53EC1 mov qword ptr [r10],rax > 000007FEC5F53EC4 mov rax,r11 > 000007FEC5F53EC7 ret > > Or > > MoveSmall10: > 000007FEC5F53EDB movzx rax,word ptr [rdx] > 000007FEC5F53EDF mov rcx,qword ptr [rdx+2] > 000007FEC5F53EE3 mov word ptr [r10],ax > 000007FEC5F53EE7 mov qword ptr [r10+2],rcx > 000007FEC5F53EEB mov rax,r11 > 000007FEC5F53EEE ret > > > So as I suspected, a smarter algorithm and a better C implementation. > Having C code generation still opens up many doors but definitely > should be used sparingly since J's algorithms are finely tuned and are > compiled with an optimized library / compiler > > > > > > > On Wed, Feb 4, 2015 at 3:17 PM, Raul Miller <[email protected]> wrote: >> Probably should delete the {. from the line declaring the bs value. >> >> That's an [hopefully] irrelevant artifact from an earlier version of the >> code. >> >> Thanks, >> >> -- >> Raul >> >> On Wed, Feb 4, 2015 at 1:43 PM, Joe Bogner <[email protected]> wrote: >>> Raul, >>> >>> That is an elegant answer and one I'll add to my notes to reference in >>> the future. >>> >>> It takes approximately 50 seconds to process 240 million 9 character >>> infixes using 50 million character blocks. >>> >>> As a comparison, I wrote an integration to TCC which enables >>> generating and calling C code from J >>> >>> https://github.com/joebo/lang-lab/blob/master/j/tcc.ijs >>> >>> It could use some cleanup and it's not tested on linux or win32 >>> >>> It includes everything to build the stub dll and then some examples. >>> >>> Here's an example of creating infixes on a character array: >>> >>> testInfix_code=: 0 : 0 >>> #include <stdio.h> >>> >>> void free2(long ptr) { >>> free(ptr); >>> } >>> >>> long infix(long pmem, long len, int infixLen, long long *out) { >>> long long allocLen = len*sizeof(char)*infixLen; >>> char *newMem = (char*)malloc(allocLen); >>> char *mem = (char*)pmem; >>> >>> printf("len: %d\n", len); >>> fflush(stdout); >>> long long offset = 0; >>> long long idx = 0; >>> >>> for(long long i=0;i<len && ((len-offset) >= infixLen);i++) { >>> for(long q=0;q<infixLen;q++) { >>> newMem[idx++] = mem[q+offset]; >>> } >>> offset++; >>> } >>> >>> *out = idx; >>> return newMem; >>> } >>> ) >>> >>> To run this: >>> >>> infix=: 3 : 0 >>> txt=.y >>> addr=. mema 4*(#txt) >>> txt memw addr,0,(#txt) >>> infixSize=.9 >>> tcc=:init_tcc_'' >>> compile_tcc_ < testInfix_code >>> ret=.execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize >>> smoutput ret >>> 'memPtr size'=: ret >>> output =: memr memPtr,0,size >>> execInt_tcc_ 'free';memPtr;0 >>> free_tcc_'' >>> output >>> ) >>> >>> testInfix =: 3 : 0 >>> txt=. 'abcdefghijklmnopqrstuvwxyz' >>> infix txt >>> ) >>> >>> _9[\ testInfix'' >>> len: 26 >>> +--------+---+ >>> |15900816|162| >>> +--------+---+ >>> abcdefghi >>> bcdefghij >>> cdefghijk >>> defghijkl >>> efghijklm >>> fghijklmn >>> ghijklmno >>> hijklmnop >>> ijklmnopq >>> jklmnopqr >>> klmnopqrs >>> lmnopqrst >>> mnopqrstu >>> nopqrstuv >>> opqrstuvw >>> pqrstuvwx >>> qrstuvwxy >>> rstuvwxyz >>> >>> >>> >>> I ended up running into problems with sizes greater than 75 million, >>> probably due to a long size issue. >>> >>> Funny enough, it's actually slower than J's version by a factor of 4. >>> >>> 6!:2 'b1=:_9[\(infix 50e6{. txt)' >>> 1.82504 >>> 6!:2 'b2=:9[\ 50e6{. txt' >>> 0.431845 >>> b1-:b2 >>> 1 >>> >>> >>> >>> This could be a combination of TCC's ability to optimize code or it >>> could be J's algorithm is implemented more efficiently than mine >>> (probably both) >>> >>> The actual JIT operation only takes about 5/100 of a second >>> >>> I'm sure I could profile it and speed it up, but I've already spent >>> considerably more time than desired on this "challenge". >>> >>> On the upside, we now have a prototype for generating C code from J >>> >>> >>> >>> On Tue, Feb 3, 2015 at 8:05 PM, Raul Miller <[email protected]> wrote: >>>> Here's how I'd implement the overlapped block thing. >>>> >>>> overlapblock=:2 :0 >>>> : >>>> 'size overlap init'=. x >>>> bs=. size*i.>.(#y)%{.size >>>> bl=. bs-~(#y)<.bs+size+overlap >>>> accum=. init >>>> for_sl. bs,.bl do. >>>> accum=. accum u v ((+i.)/sl){y >>>> end. >>>> ) >>>> >>>> #(10;9;i.0 0) ~.@, overlapblock (~.@(9&(]\))) 30$'abcdefghij' >>>> 10 >>>> >>>> Of course change the value for x and y to match what you want to do... >>>> And maybe recode to swap u and v, if that looks better. >>>> >>>> If that's too slow, it might be better to gather intermediate results >>>> from each block and then post-process them in a single pass? >>>> >>>> Thanks, >>>> >>>> -- >>>> Raul >>>> >>>> On Tue, Feb 3, 2015 at 6:24 PM, Joe Bogner <[email protected]> wrote: >>>>> On Tue, Feb 3, 2015 at 6:17 PM, Raul Miller <[email protected]> wrote: >>>>>> So you are working with non-overlapping infixes? >>>>>> >>>>>> _9[\'abcdefghijklmnopqrstuvwxyz' >>>>>> abcdefghi >>>>>> jklmnopqr >>>>>> stuvwxyz >>>>>> >>>>> >>>>> >>>>> Sorry, no that was a typo. I can use non-overlapping to split up into >>>>> blocks, but I need overlapping for the string. >>>>> >>>>> It is overlapping infixes as I am calculating the unique 9 character >>>>> substrings >>>>> >>>>> 9[\abc >>>>> abcdefghi >>>>> bcdefghij >>>>> cdefghijk >>>>> defghijkl >>>>> efghijklm >>>>> fghijklmn >>>>> ghijklmno >>>>> hijklmnop >>>>> ijklmnopq >>>>> jklmnopqr >>>>> klmnopqrs >>>>> lmnopqrst >>>>> mnopqrstu >>>>> nopqrstuv >>>>> opqrstuvw >>>>> pqrstuvwx >>>>> qrstuvwxy >>>>> rstuvwxyz >>>>> >>>>> # ~. 9[\abc >>>>> 18 >>>>> ---------------------------------------------------------------------- >>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
