I chased down the difference in the performance in my infix and J's infix

As I figured, J's algorithm was more efficient. I implemented the same
algorithm and cut the time by nearly a factor of 2 to 0.78 seconds.
Still behind J's 0.43 seconds. Inspecting the assembly shows that TCC
is executing similar instructions but somewhat different. Also, I
might assume that Visual C's memcpy is more optimized than the TCC
version (see below)

In any case, this serves as a reminder as to why interpreted languages
can be faster than compiled ones. J's algorithm was written more
efficiently than mine:

J (effectively)

    char *x=newMem;
    char *y=mem;
    for(int i=0;i<d;i++) {
        memcpy(x,y,infixLen);
        x+=infixLen;
        y+=1;
    }

My attempt:

    long offset = 0;
    long idx = 0;

    while(1) {
        for(int q=0;q<infixLen;q++) {
            newMem[idx++] = mem[q+offset];
        }
        offset++;
        if ((len-offset) <= infixLen) {
           break;
        }
    }

Mine has an extra loop that can switched to memcpy which can do
smarter things than a loop knowing the size ahead of time.

It just so happens that msvcrt's memcpy has special code for moves of
different sizes:

In the case of copying 9 bytes

It looks like it can move 8 bytes in only a few instructions instead
of each individual byte that the loop would do

MoveSmall9:
000007FEC5F53EC8  movzx       rax,byte ptr [rdx]
000007FEC5F53ECC  mov         rcx,qword ptr [rdx+1]
000007FEC5F53ED0  mov         byte ptr [r10],al
000007FEC5F53ED3  mov         qword ptr [r10+1],rcx
000007FEC5F53ED7  mov         rax,r11
000007FEC5F53EDA  ret


Compared to:

MoveSmall8:
000007FEC5F53EBE  mov         rax,qword ptr [rdx]
000007FEC5F53EC1  mov         qword ptr [r10],rax
000007FEC5F53EC4  mov         rax,r11
000007FEC5F53EC7  ret

Or

MoveSmall10:
000007FEC5F53EDB  movzx       rax,word ptr [rdx]
000007FEC5F53EDF  mov         rcx,qword ptr [rdx+2]
000007FEC5F53EE3  mov         word ptr [r10],ax
000007FEC5F53EE7  mov         qword ptr [r10+2],rcx
000007FEC5F53EEB  mov         rax,r11
000007FEC5F53EEE  ret


So as I suspected, a smarter algorithm and a better C implementation.
Having C code generation still opens up many doors but definitely
should be used sparingly since J's algorithms are finely tuned and are
compiled with an optimized library / compiler






On Wed, Feb 4, 2015 at 3:17 PM, Raul Miller <[email protected]> wrote:
> Probably should delete the {. from the line declaring the bs value.
>
> That's an [hopefully] irrelevant artifact from an earlier version of the code.
>
> Thanks,
>
> --
> Raul
>
> On Wed, Feb 4, 2015 at 1:43 PM, Joe Bogner <[email protected]> wrote:
>> Raul,
>>
>> That is an elegant answer and one I'll add to my notes to reference in
>> the future.
>>
>> It takes approximately 50 seconds to process 240 million 9 character
>> infixes using 50 million character blocks.
>>
>> As a comparison, I wrote an integration to TCC which enables
>> generating and calling C code from J
>>
>> https://github.com/joebo/lang-lab/blob/master/j/tcc.ijs
>>
>> It could use some cleanup and it's not tested on linux or win32
>>
>> It includes everything to build the stub dll and then some examples.
>>
>> Here's an example of creating infixes on a character array:
>>
>> testInfix_code=: 0 : 0
>> #include <stdio.h>
>>
>> void free2(long ptr) {
>>  free(ptr);
>> }
>>
>> long infix(long pmem, long len, int infixLen, long long *out) {
>>     long long allocLen = len*sizeof(char)*infixLen;
>>     char *newMem = (char*)malloc(allocLen);
>>     char *mem = (char*)pmem;
>>
>>     printf("len: %d\n", len);
>>     fflush(stdout);
>>     long long offset = 0;
>>     long long idx = 0;
>>
>>     for(long long i=0;i<len && ((len-offset) >= infixLen);i++) {
>>         for(long q=0;q<infixLen;q++) {
>>             newMem[idx++] = mem[q+offset];
>>         }
>>         offset++;
>>     }
>>
>>     *out = idx;
>>     return newMem;
>> }
>> )
>>
>> To run this:
>>
>> infix=: 3 : 0
>> txt=.y
>> addr=. mema 4*(#txt)
>> txt memw addr,0,(#txt)
>> infixSize=.9
>> tcc=:init_tcc_''
>> compile_tcc_ < testInfix_code
>> ret=.execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize
>> smoutput ret
>> 'memPtr size'=: ret
>> output =: memr memPtr,0,size
>> execInt_tcc_ 'free';memPtr;0
>> free_tcc_''
>> output
>> )
>>
>> testInfix =: 3 : 0
>> txt=. 'abcdefghijklmnopqrstuvwxyz'
>> infix txt
>> )
>>
>>   _9[\ testInfix''
>> len: 26
>> +--------+---+
>> |15900816|162|
>> +--------+---+
>> abcdefghi
>> bcdefghij
>> cdefghijk
>> defghijkl
>> efghijklm
>> fghijklmn
>> ghijklmno
>> hijklmnop
>> ijklmnopq
>> jklmnopqr
>> klmnopqrs
>> lmnopqrst
>> mnopqrstu
>> nopqrstuv
>> opqrstuvw
>> pqrstuvwx
>> qrstuvwxy
>> rstuvwxyz
>>
>>
>>
>> I ended up running into problems with sizes greater than 75 million,
>> probably due to a long size issue.
>>
>> Funny enough, it's actually slower than J's version by a factor of 4.
>>
>>    6!:2 'b1=:_9[\(infix 50e6{. txt)'
>> 1.82504
>>    6!:2 'b2=:9[\ 50e6{. txt'
>> 0.431845
>>    b1-:b2
>> 1
>>
>>
>>
>> This could be a combination of TCC's ability to optimize code or it
>> could be J's algorithm is implemented more efficiently than mine
>> (probably both)
>>
>> The actual JIT operation only takes about 5/100 of a second
>>
>> I'm sure I could profile it and speed it up, but I've already spent
>> considerably more time than desired on this "challenge".
>>
>> On the upside, we now have a prototype for generating C code from J
>>
>>
>>
>> On Tue, Feb 3, 2015 at 8:05 PM, Raul Miller <[email protected]> wrote:
>>> Here's how I'd implement the overlapped block thing.
>>>
>>> overlapblock=:2 :0
>>> :
>>>   'size overlap init'=. x
>>>   bs=. size*i.>.(#y)%{.size
>>>   bl=. bs-~(#y)<.bs+size+overlap
>>>   accum=. init
>>>   for_sl. bs,.bl do.
>>>     accum=. accum u v ((+i.)/sl){y
>>>   end.
>>> )
>>>
>>>    #(10;9;i.0 0) ~.@, overlapblock (~.@(9&(]\))) 30$'abcdefghij'
>>> 10
>>>
>>> Of course change the value for x and y to match what you want to do...
>>> And maybe recode to swap u and v, if that looks better.
>>>
>>> If that's too slow, it might be better to gather intermediate results
>>> from each block and then post-process them in a single pass?
>>>
>>> Thanks,
>>>
>>> --
>>> Raul
>>>
>>> On Tue, Feb 3, 2015 at 6:24 PM, Joe Bogner <[email protected]> wrote:
>>>> On Tue, Feb 3, 2015 at 6:17 PM, Raul Miller <[email protected]> wrote:
>>>>> So you are working with non-overlapping infixes?
>>>>>
>>>>>    _9[\'abcdefghijklmnopqrstuvwxyz'
>>>>> abcdefghi
>>>>> jklmnopqr
>>>>> stuvwxyz
>>>>>
>>>>
>>>>
>>>> Sorry, no that was a typo. I can use non-overlapping to split up into
>>>> blocks, but I need overlapping for the string.
>>>>
>>>> It is overlapping infixes as I am calculating the unique 9 character 
>>>> substrings
>>>>
>>>> 9[\abc
>>>> abcdefghi
>>>> bcdefghij
>>>> cdefghijk
>>>> defghijkl
>>>> efghijklm
>>>> fghijklmn
>>>> ghijklmno
>>>> hijklmnop
>>>> ijklmnopq
>>>> jklmnopqr
>>>> klmnopqrs
>>>> lmnopqrst
>>>> mnopqrstu
>>>> nopqrstuv
>>>> opqrstuvw
>>>> pqrstuvwx
>>>> qrstuvwxy
>>>> rstuvwxyz
>>>>
>>>> # ~. 9[\abc
>>>> 18
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to