Re: [Jchat] quickest looping construct

Joe Bogner Fri, 06 Feb 2015 08:35:36 -0800

Raul, thanks for the suggestion. I hadn't considered it. I implemented
my interpretation of it below, however it's fairly slow but efficient
with memory


To recap:

Native J: # ~. 9[\txt,
49 seconds, 12.89GB mem used

JIT C with qsort [1]:
111 seconds, only about 3.3GB used

JIT C with djb2 hash table [2]:
[partial implementation]
55 seconds, 8GB mem used

It's partial implementation because I didn't finish the linear
probing. It found 2,476,989 possible matches, which is incomplete
because it had 47,279 unresolved collisions. The correct answer is
2,482,912 unique values out of 242,193,534. I stopped once I realized
that it wasn't going to catch up to J


Overall Summary:
Multi-threading only saves around 1 seconds on the infix. I kicked
around threading the unique count, but punted after realizing that
each thread would have it's own set of unique values based upon the
block it was processing. I would either need to synchronize access to
a shared data structure or re-calculate the unique set from each
threads unique set. This was more complex than I wanted to deal with
so I moved the unique count to the primary thread.

Making the unique count multi-threaded didn't seem practical either. I
would need a multi-threaded qsort or hash table. Research suggested
that multi-threaded qsort was not viable. A multi-threaded hash gets
back to the same problem of synchronization or find a lock-less
implementation. Either way, it seemed like the limited factor was
memory access speed which wouldn't be improved by multi-core.

I will conclude this research with the following thoughts. I've read
about the benefits and tradeoffs of parallel programming but hadn't
really put it to the test. We've talked about how aspects of J would
benefit from parallel execution. There are some instances where it can
help. Infix had a minor speedup because it's block oriented and
doesn't require any coordination between the blocks. nub does not fit
that criteria. The greater need and application of threading in J may
be concurrency and not parallelism[3].


Appendix:


[1] JIT C with qsort - produces correct result

//hacky hard code infix len
int compare(const void* a, const void* b) {
    return strncmp(a, b, 9);
}

int count_unique_sort(void* newMem, LONG infixLen, LONG rowCt) {
    long uniq = 0;
    qsort(newMem, rowCt, sizeof(char)*infixLen, compare);
    char *pMem = newMem;
    for(LONG i=0;i<rowCt;i++) {
        int match = strncmp(pMem, (pMem+infixLen), infixLen);
        if (match != 0) {
           uniq++;
        }
        pMem+=infixLen;
    }
    return uniq;
}

int uniqCt2 = count_unique_sort(newMem, infixLen, d);
printf("count_unique_sort found %d unique values\n", uniqCt2);


[2] JIT C with djb2 hash table:

Without the collision detection (pure hashing plus infix) is about 30 seconds
The collision detection, which requires storing the pointer to the row
and comparing the current row to the value in the hash table, adds
about 20 seconds.
I would need to add some mechanism to handle the collision (e.g.
linear probing) to deal with it. This would have put me over the J
time.

//http://www.cse.yorku.ca/~oz/hash.html
unsigned long djb2(unsigned char *str, int len) {
    unsigned long hash = 5381;
    for(int i=0;i<len;i++)
        hash = ((hash << 5) + hash) + str[i]; /* hash * 33 + c */

    return hash;
}

int count_unique_hash(void* newMem, LONG infixLen, LONG rowCt) {
    char *pMem = newMem;
    //http://www.jsoftware.com/papers/indexof/indexofscript.htm
    unsigned long tableSize = 2*rowCt;
    char**table = VirtualAlloc(0, tableSize * sizeof(char*),
MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

    //adds 2 seconds to initialize
    for(LONG i =0; i < tableSize; i++) {
        table[i] = 0;
    }
    LONG matches=0;
    LONG collisions=0;
    for(LONG i = 0; i < rowCt; i++) {
        unsigned long hash = djb2(pMem, infixLen) % tableSize;
        if (table[hash] == 0) {
           table[hash] = pMem;
           matches++;
        }
        else {
           int match = strncmp(pMem, table[hash], infixLen);
           if (match !=0) {
              collisions++;
           }
        }
        //printf("key: %d\n", hash);
        pMem+=infixLen;
    }
    free(table);
    printf("done, found %d possible matches, collisions: %d\n",
matches, collisions);
    return matches;
}


[3] - 
http://stackoverflow.com/questions/1050222/concurrency-vs-parallelism-what-is-the-difference

On Thu, Feb 5, 2015 at 11:13 PM, Raul Miller <[email protected]> wrote:
> Something to benchmark, if you want to go there (since you're working
> with constant a width character array):
>
> sort it, then walk through it, to find the uniques.
>
> (In C I'd be tempted to do this in two passes, once generating a bit
> mask and a count, to find out how much memory I need for the result.
> The second to fill in the result.)
>
> Thanks,
>
> --
> Raul
>
> On Thu, Feb 5, 2015 at 11:08 PM, Joe Bogner <[email protected]> wrote:
>> I decided to take the tcc experiment a few steps further and
>> implemented a multi-threaded version of infix which is JIT compiled
>> into the J session
>>
>> The C code is far from perfect. Single threaded it takes 3.716 seconds
>> to process a 9 character overlapping infix on a 242 million character
>> array.
>>
>> J can still natively complete it in in 2 seconds with 9[\ txt
>>
>> Using 4 threads, it completes in 2.87 seconds (22% improvement over
>> single threaded, 43% slower than J's single threaded with MS optimized
>> memcpy) and uses 100% cpu (good)
>>
>> I learned along the way that malloc will fail with allocations greater
>> than 2gb in the dll loaded into J. Not sure why, but I was able to
>> work around it by using VirtualAlloc
>>
>> I realize this is a deep dive, but figured others may be interested
>> and it seems chat worthy. It reinforces that multiple threads aren't
>> always needed, however it proves that we can do multi-threaded code
>> from within J which may be useful in other applications.
>>
>> The next task is to do a multi-threaded overlapped nub count.
>>
>> Even though J can do a 9 character overlapping prefix in roughly 2
>> seconds, it takes 49 seconds to do a nub of that.
>>
>> $ 9[\txt
>> 242193526 9
>>
>> 6!:2 '# ~. 9[\txt'
>> 49.4751
>>
>> Since the multi-threaded infix runs similarly to J's native, it seems
>> I could write 'special code' to do a nub count over the infixes
>> without saving each one.
>>
>> Raul's block approach takes about the same time ~ 50 seconds to complete.
>>
>> The nub count infix won't be actually used for anything useful other
>> than for learning.
>>
>> Here is the multi-threaded infix in J:
>>
>> testInfixThreaded_code=: 0 : 0
>> #include <stdio.h>
>> #include <windows.h>
>> #include <process.h>
>>
>> #define LONG long long
>>
>> const int THREADS = 4;
>> void free(void* ptr) {
>>  //free wont work since we use VirtualAlloc
>>  //free(ptr);
>>  VirtualFree(ptr, 0, MEM_RELEASE);
>> }
>>
>>
>> typedef struct ThreadData ThreadData;
>> struct ThreadData {
>>        void *mem;
>>        int threadId;
>>        LONG d;
>>        int infixLen;
>>        void *newMem;
>>        LONG len;
>> };
>>
>> int  infixThreaded(void *threadData) {
>>     ThreadData *args = (ThreadData*)threadData;
>>     LONG infixLen = args->infixLen;
>>     LONG blockSize = (args->d+(THREADS-1))/THREADS;
>>     LONG d = args->d;
>>
>>     char *x=(char*)args->newMem;
>>     char *y=args->mem;
>>     LONG ctr = 0;
>>
>>     for(LONG i=0;i<args->threadId*blockSize;i++) {
>>         x+=infixLen;
>>         y+=1;
>>         ctr+=1;
>>     }
>>
>>
>>     for(LONG i=0;i<=blockSize && ctr<=d;i++) {
>>         memcpy(x,y,infixLen);
>>         //printf("start: %d on thread %d, blockSize : %d\n", ctr,
>> args->threadId, blockSize);
>>         x+=infixLen;
>>         y+=1;
>>         ctr++;
>>     }
>>
>>     return 0;
>> }
>>
>>
>> int infix(void* pmem, LONG len, LONG infixLen, LONG *out) {
>>     LONG allocLen = len*sizeof(char)*infixLen;
>>
>>     //malloc fails on larger than 2gb allocations
>>     //char *newMem = (char*)malloc(allocLen);
>>     char *newMem = VirtualAlloc(0, allocLen, MEM_COMMIT | MEM_RESERVE,
>> PAGE_READWRITE);
>>     if (NULL == newMem) {
>>        printf("could not allocate\n");
>>        fflush(stdout);
>>        return -1;
>>     }
>>
>>     char *mem = (char*)pmem;
>>     LONG d = 1+len-infixLen;
>>
>>
>>     //skip threads for small args
>>     if (d < 1000000) {
>>         char *x=newMem;
>>         char *y=mem;
>>         for(int i=0;i<d;i++) {
>>             memcpy(x,y,infixLen);
>>             x+=infixLen;
>>             y+=1;
>>         }
>>     } else {
>>
>>         HANDLE* threads = (HANDLE*)malloc(sizeof(HANDLE)*THREADS);
>>         ThreadData **args = (ThreadData**)malloc(sizeof(ThreadData)*THREADS);
>>         for(int i=0;i<THREADS;i++) {
>>             ThreadData *arg = (ThreadData*)malloc(sizeof(ThreadData));
>>             arg->mem = pmem;
>>             arg->threadId = i;
>>             arg->d = d;
>>             arg->infixLen = infixLen;
>>             arg->newMem = newMem;
>>             arg->len = len;
>>             args[i] = arg;
>>             threads[i] = _beginthreadex(NULL, 0, &infixThreaded, arg, 0, 
>> NULL);
>>         }
>>         WaitForMultipleObjects(THREADS, threads, 1, INFINITE);
>>         for(int i = 0; i<THREADS; i++) { CloseHandle(threads[i]);
>> free(args[i]); }
>>         free(threads);
>>     }
>>
>>     printf("done\n");
>>
>>     fflush(stdout);
>>     *out = (d*infixLen);
>>     return newMem;
>> }
>>
>> )
>>
>> testinfixThreaded=: 3 : 0
>> txt=.y
>> addr=. mema 4*(#txt)
>> txt memw addr,0,(#txt)
>> infixSize=.9
>> tcc=:init_tcc_''
>> compile_tcc_ < testInfixThreaded_code
>> ret=:execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize
>> NB. smoutput ret
>> 'memPtr size'=: ret
>> output =: memr memPtr,0,size
>> execInt_tcc_ 'free';memPtr;0
>> smoutput 'done'
>> free_tcc_''
>> ret
>> NB. output
>> )
>>
>> On Wed, Feb 4, 2015 at 1:43 PM, Joe Bogner <[email protected]> wrote:
>>> Raul,
>>>
>>> That is an elegant answer and one I'll add to my notes to reference in
>>> the future.
>>>
>>> It takes approximately 50 seconds to process 240 million 9 character
>>> infixes using 50 million character blocks.
>>>
>>> As a comparison, I wrote an integration to TCC which enables
>>> generating and calling C code from J
>>>
>>> https://github.com/joebo/lang-lab/blob/master/j/tcc.ijs
>>>
>>> It could use some cleanup and it's not tested on linux or win32
>>>
>>> It includes everything to build the stub dll and then some examples.
>>>
>>> Here's an example of creating infixes on a character array:
>>>
>>> testInfix_code=: 0 : 0
>>> #include <stdio.h>
>>>
>>> void free2(long ptr) {
>>>  free(ptr);
>>> }
>>>
>>> long infix(long pmem, long len, int infixLen, long long *out) {
>>>     long long allocLen = len*sizeof(char)*infixLen;
>>>     char *newMem = (char*)malloc(allocLen);
>>>     char *mem = (char*)pmem;
>>>
>>>     printf("len: %d\n", len);
>>>     fflush(stdout);
>>>     long long offset = 0;
>>>     long long idx = 0;
>>>
>>>     for(long long i=0;i<len && ((len-offset) >= infixLen);i++) {
>>>         for(long q=0;q<infixLen;q++) {
>>>             newMem[idx++] = mem[q+offset];
>>>         }
>>>         offset++;
>>>     }
>>>
>>>     *out = idx;
>>>     return newMem;
>>> }
>>> )
>>>
>>> To run this:
>>>
>>> infix=: 3 : 0
>>> txt=.y
>>> addr=. mema 4*(#txt)
>>> txt memw addr,0,(#txt)
>>> infixSize=.9
>>> tcc=:init_tcc_''
>>> compile_tcc_ < testInfix_code
>>> ret=.execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize
>>> smoutput ret
>>> 'memPtr size'=: ret
>>> output =: memr memPtr,0,size
>>> execInt_tcc_ 'free';memPtr;0
>>> free_tcc_''
>>> output
>>> )
>>>
>>> testInfix =: 3 : 0
>>> txt=. 'abcdefghijklmnopqrstuvwxyz'
>>> infix txt
>>> )
>>>
>>>   _9[\ testInfix''
>>> len: 26
>>> +--------+---+
>>> |15900816|162|
>>> +--------+---+
>>> abcdefghi
>>> bcdefghij
>>> cdefghijk
>>> defghijkl
>>> efghijklm
>>> fghijklmn
>>> ghijklmno
>>> hijklmnop
>>> ijklmnopq
>>> jklmnopqr
>>> klmnopqrs
>>> lmnopqrst
>>> mnopqrstu
>>> nopqrstuv
>>> opqrstuvw
>>> pqrstuvwx
>>> qrstuvwxy
>>> rstuvwxyz
>>>
>>>
>>>
>>> I ended up running into problems with sizes greater than 75 million,
>>> probably due to a long size issue.
>>>
>>> Funny enough, it's actually slower than J's version by a factor of 4.
>>>
>>>    6!:2 'b1=:_9[\(infix 50e6{. txt)'
>>> 1.82504
>>>    6!:2 'b2=:9[\ 50e6{. txt'
>>> 0.431845
>>>    b1-:b2
>>> 1
>>>
>>>
>>>
>>> This could be a combination of TCC's ability to optimize code or it
>>> could be J's algorithm is implemented more efficiently than mine
>>> (probably both)
>>>
>>> The actual JIT operation only takes about 5/100 of a second
>>>
>>> I'm sure I could profile it and speed it up, but I've already spent
>>> considerably more time than desired on this "challenge".
>>>
>>> On the upside, we now have a prototype for generating C code from J
>>>
>>>
>>>
>>> On Tue, Feb 3, 2015 at 8:05 PM, Raul Miller <[email protected]> wrote:
>>>> Here's how I'd implement the overlapped block thing.
>>>>
>>>> overlapblock=:2 :0
>>>> :
>>>>   'size overlap init'=. x
>>>>   bs=. size*i.>.(#y)%{.size
>>>>   bl=. bs-~(#y)<.bs+size+overlap
>>>>   accum=. init
>>>>   for_sl. bs,.bl do.
>>>>     accum=. accum u v ((+i.)/sl){y
>>>>   end.
>>>> )
>>>>
>>>>    #(10;9;i.0 0) ~.@, overlapblock (~.@(9&(]\))) 30$'abcdefghij'
>>>> 10
>>>>
>>>> Of course change the value for x and y to match what you want to do...
>>>> And maybe recode to swap u and v, if that looks better.
>>>>
>>>> If that's too slow, it might be better to gather intermediate results
>>>> from each block and then post-process them in a single pass?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Raul
>>>>
>>>> On Tue, Feb 3, 2015 at 6:24 PM, Joe Bogner <[email protected]> wrote:
>>>>> On Tue, Feb 3, 2015 at 6:17 PM, Raul Miller <[email protected]> wrote:
>>>>>> So you are working with non-overlapping infixes?
>>>>>>
>>>>>>    _9[\'abcdefghijklmnopqrstuvwxyz'
>>>>>> abcdefghi
>>>>>> jklmnopqr
>>>>>> stuvwxyz
>>>>>>
>>>>>
>>>>>
>>>>> Sorry, no that was a typo. I can use non-overlapping to split up into
>>>>> blocks, but I need overlapping for the string.
>>>>>
>>>>> It is overlapping infixes as I am calculating the unique 9 character 
>>>>> substrings
>>>>>
>>>>> 9[\abc
>>>>> abcdefghi
>>>>> bcdefghij
>>>>> cdefghijk
>>>>> defghijkl
>>>>> efghijklm
>>>>> fghijklmn
>>>>> ghijklmno
>>>>> hijklmnop
>>>>> ijklmnopq
>>>>> jklmnopqr
>>>>> klmnopqrs
>>>>> lmnopqrst
>>>>> mnopqrstu
>>>>> nopqrstuv
>>>>> opqrstuvw
>>>>> pqrstuvwx
>>>>> qrstuvwxy
>>>>> rstuvwxyz
>>>>>
>>>>> # ~. 9[\abc
>>>>> 18
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] quickest looping construct

Reply via email to