Something to benchmark, if you want to go there (since you're working with constant a width character array):
sort it, then walk through it, to find the uniques. (In C I'd be tempted to do this in two passes, once generating a bit mask and a count, to find out how much memory I need for the result. The second to fill in the result.) Thanks, -- Raul On Thu, Feb 5, 2015 at 11:08 PM, Joe Bogner <[email protected]> wrote: > I decided to take the tcc experiment a few steps further and > implemented a multi-threaded version of infix which is JIT compiled > into the J session > > The C code is far from perfect. Single threaded it takes 3.716 seconds > to process a 9 character overlapping infix on a 242 million character > array. > > J can still natively complete it in in 2 seconds with 9[\ txt > > Using 4 threads, it completes in 2.87 seconds (22% improvement over > single threaded, 43% slower than J's single threaded with MS optimized > memcpy) and uses 100% cpu (good) > > I learned along the way that malloc will fail with allocations greater > than 2gb in the dll loaded into J. Not sure why, but I was able to > work around it by using VirtualAlloc > > I realize this is a deep dive, but figured others may be interested > and it seems chat worthy. It reinforces that multiple threads aren't > always needed, however it proves that we can do multi-threaded code > from within J which may be useful in other applications. > > The next task is to do a multi-threaded overlapped nub count. > > Even though J can do a 9 character overlapping prefix in roughly 2 > seconds, it takes 49 seconds to do a nub of that. > > $ 9[\txt > 242193526 9 > > 6!:2 '# ~. 9[\txt' > 49.4751 > > Since the multi-threaded infix runs similarly to J's native, it seems > I could write 'special code' to do a nub count over the infixes > without saving each one. > > Raul's block approach takes about the same time ~ 50 seconds to complete. > > The nub count infix won't be actually used for anything useful other > than for learning. > > Here is the multi-threaded infix in J: > > testInfixThreaded_code=: 0 : 0 > #include <stdio.h> > #include <windows.h> > #include <process.h> > > #define LONG long long > > const int THREADS = 4; > void free(void* ptr) { > //free wont work since we use VirtualAlloc > //free(ptr); > VirtualFree(ptr, 0, MEM_RELEASE); > } > > > typedef struct ThreadData ThreadData; > struct ThreadData { > void *mem; > int threadId; > LONG d; > int infixLen; > void *newMem; > LONG len; > }; > > int infixThreaded(void *threadData) { > ThreadData *args = (ThreadData*)threadData; > LONG infixLen = args->infixLen; > LONG blockSize = (args->d+(THREADS-1))/THREADS; > LONG d = args->d; > > char *x=(char*)args->newMem; > char *y=args->mem; > LONG ctr = 0; > > for(LONG i=0;i<args->threadId*blockSize;i++) { > x+=infixLen; > y+=1; > ctr+=1; > } > > > for(LONG i=0;i<=blockSize && ctr<=d;i++) { > memcpy(x,y,infixLen); > //printf("start: %d on thread %d, blockSize : %d\n", ctr, > args->threadId, blockSize); > x+=infixLen; > y+=1; > ctr++; > } > > return 0; > } > > > int infix(void* pmem, LONG len, LONG infixLen, LONG *out) { > LONG allocLen = len*sizeof(char)*infixLen; > > //malloc fails on larger than 2gb allocations > //char *newMem = (char*)malloc(allocLen); > char *newMem = VirtualAlloc(0, allocLen, MEM_COMMIT | MEM_RESERVE, > PAGE_READWRITE); > if (NULL == newMem) { > printf("could not allocate\n"); > fflush(stdout); > return -1; > } > > char *mem = (char*)pmem; > LONG d = 1+len-infixLen; > > > //skip threads for small args > if (d < 1000000) { > char *x=newMem; > char *y=mem; > for(int i=0;i<d;i++) { > memcpy(x,y,infixLen); > x+=infixLen; > y+=1; > } > } else { > > HANDLE* threads = (HANDLE*)malloc(sizeof(HANDLE)*THREADS); > ThreadData **args = (ThreadData**)malloc(sizeof(ThreadData)*THREADS); > for(int i=0;i<THREADS;i++) { > ThreadData *arg = (ThreadData*)malloc(sizeof(ThreadData)); > arg->mem = pmem; > arg->threadId = i; > arg->d = d; > arg->infixLen = infixLen; > arg->newMem = newMem; > arg->len = len; > args[i] = arg; > threads[i] = _beginthreadex(NULL, 0, &infixThreaded, arg, 0, > NULL); > } > WaitForMultipleObjects(THREADS, threads, 1, INFINITE); > for(int i = 0; i<THREADS; i++) { CloseHandle(threads[i]); > free(args[i]); } > free(threads); > } > > printf("done\n"); > > fflush(stdout); > *out = (d*infixLen); > return newMem; > } > > ) > > testinfixThreaded=: 3 : 0 > txt=.y > addr=. mema 4*(#txt) > txt memw addr,0,(#txt) > infixSize=.9 > tcc=:init_tcc_'' > compile_tcc_ < testInfixThreaded_code > ret=:execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize > NB. smoutput ret > 'memPtr size'=: ret > output =: memr memPtr,0,size > execInt_tcc_ 'free';memPtr;0 > smoutput 'done' > free_tcc_'' > ret > NB. output > ) > > On Wed, Feb 4, 2015 at 1:43 PM, Joe Bogner <[email protected]> wrote: >> Raul, >> >> That is an elegant answer and one I'll add to my notes to reference in >> the future. >> >> It takes approximately 50 seconds to process 240 million 9 character >> infixes using 50 million character blocks. >> >> As a comparison, I wrote an integration to TCC which enables >> generating and calling C code from J >> >> https://github.com/joebo/lang-lab/blob/master/j/tcc.ijs >> >> It could use some cleanup and it's not tested on linux or win32 >> >> It includes everything to build the stub dll and then some examples. >> >> Here's an example of creating infixes on a character array: >> >> testInfix_code=: 0 : 0 >> #include <stdio.h> >> >> void free2(long ptr) { >> free(ptr); >> } >> >> long infix(long pmem, long len, int infixLen, long long *out) { >> long long allocLen = len*sizeof(char)*infixLen; >> char *newMem = (char*)malloc(allocLen); >> char *mem = (char*)pmem; >> >> printf("len: %d\n", len); >> fflush(stdout); >> long long offset = 0; >> long long idx = 0; >> >> for(long long i=0;i<len && ((len-offset) >= infixLen);i++) { >> for(long q=0;q<infixLen;q++) { >> newMem[idx++] = mem[q+offset]; >> } >> offset++; >> } >> >> *out = idx; >> return newMem; >> } >> ) >> >> To run this: >> >> infix=: 3 : 0 >> txt=.y >> addr=. mema 4*(#txt) >> txt memw addr,0,(#txt) >> infixSize=.9 >> tcc=:init_tcc_'' >> compile_tcc_ < testInfix_code >> ret=.execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize >> smoutput ret >> 'memPtr size'=: ret >> output =: memr memPtr,0,size >> execInt_tcc_ 'free';memPtr;0 >> free_tcc_'' >> output >> ) >> >> testInfix =: 3 : 0 >> txt=. 'abcdefghijklmnopqrstuvwxyz' >> infix txt >> ) >> >> _9[\ testInfix'' >> len: 26 >> +--------+---+ >> |15900816|162| >> +--------+---+ >> abcdefghi >> bcdefghij >> cdefghijk >> defghijkl >> efghijklm >> fghijklmn >> ghijklmno >> hijklmnop >> ijklmnopq >> jklmnopqr >> klmnopqrs >> lmnopqrst >> mnopqrstu >> nopqrstuv >> opqrstuvw >> pqrstuvwx >> qrstuvwxy >> rstuvwxyz >> >> >> >> I ended up running into problems with sizes greater than 75 million, >> probably due to a long size issue. >> >> Funny enough, it's actually slower than J's version by a factor of 4. >> >> 6!:2 'b1=:_9[\(infix 50e6{. txt)' >> 1.82504 >> 6!:2 'b2=:9[\ 50e6{. txt' >> 0.431845 >> b1-:b2 >> 1 >> >> >> >> This could be a combination of TCC's ability to optimize code or it >> could be J's algorithm is implemented more efficiently than mine >> (probably both) >> >> The actual JIT operation only takes about 5/100 of a second >> >> I'm sure I could profile it and speed it up, but I've already spent >> considerably more time than desired on this "challenge". >> >> On the upside, we now have a prototype for generating C code from J >> >> >> >> On Tue, Feb 3, 2015 at 8:05 PM, Raul Miller <[email protected]> wrote: >>> Here's how I'd implement the overlapped block thing. >>> >>> overlapblock=:2 :0 >>> : >>> 'size overlap init'=. x >>> bs=. size*i.>.(#y)%{.size >>> bl=. bs-~(#y)<.bs+size+overlap >>> accum=. init >>> for_sl. bs,.bl do. >>> accum=. accum u v ((+i.)/sl){y >>> end. >>> ) >>> >>> #(10;9;i.0 0) ~.@, overlapblock (~.@(9&(]\))) 30$'abcdefghij' >>> 10 >>> >>> Of course change the value for x and y to match what you want to do... >>> And maybe recode to swap u and v, if that looks better. >>> >>> If that's too slow, it might be better to gather intermediate results >>> from each block and then post-process them in a single pass? >>> >>> Thanks, >>> >>> -- >>> Raul >>> >>> On Tue, Feb 3, 2015 at 6:24 PM, Joe Bogner <[email protected]> wrote: >>>> On Tue, Feb 3, 2015 at 6:17 PM, Raul Miller <[email protected]> wrote: >>>>> So you are working with non-overlapping infixes? >>>>> >>>>> _9[\'abcdefghijklmnopqrstuvwxyz' >>>>> abcdefghi >>>>> jklmnopqr >>>>> stuvwxyz >>>>> >>>> >>>> >>>> Sorry, no that was a typo. I can use non-overlapping to split up into >>>> blocks, but I need overlapping for the string. >>>> >>>> It is overlapping infixes as I am calculating the unique 9 character >>>> substrings >>>> >>>> 9[\abc >>>> abcdefghi >>>> bcdefghij >>>> cdefghijk >>>> defghijkl >>>> efghijklm >>>> fghijklmn >>>> ghijklmno >>>> hijklmnop >>>> ijklmnopq >>>> jklmnopqr >>>> klmnopqrs >>>> lmnopqrst >>>> mnopqrstu >>>> nopqrstuv >>>> opqrstuvw >>>> pqrstuvwx >>>> qrstuvwxy >>>> rstuvwxyz >>>> >>>> # ~. 9[\abc >>>> 18 >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
