Re: [Jchat] quickest looping construct

Raul Miller Wed, 04 Feb 2015 18:14:18 -0800

Yes... as is so often the case, the primary benefit of the exercise is
an increased understanding.


Speaking of which, what are you using to get the assembly
representations of the memcpy internals?

Thanks,

-- 
Raul


On Wed, Feb 4, 2015 at 3:54 PM, Joe Bogner <[email protected]> wrote:
> I chased down the difference in the performance in my infix and J's infix
>
> As I figured, J's algorithm was more efficient. I implemented the same
> algorithm and cut the time by nearly a factor of 2 to 0.78 seconds.
> Still behind J's 0.43 seconds. Inspecting the assembly shows that TCC
> is executing similar instructions but somewhat different. Also, I
> might assume that Visual C's memcpy is more optimized than the TCC
> version (see below)
>
> In any case, this serves as a reminder as to why interpreted languages
> can be faster than compiled ones. J's algorithm was written more
> efficiently than mine:
>
> J (effectively)
>
>     char *x=newMem;
>     char *y=mem;
>     for(int i=0;i<d;i++) {
>         memcpy(x,y,infixLen);
>         x+=infixLen;
>         y+=1;
>     }
>
> My attempt:
>
>     long offset = 0;
>     long idx = 0;
>
>     while(1) {
>         for(int q=0;q<infixLen;q++) {
>             newMem[idx++] = mem[q+offset];
>         }
>         offset++;
>         if ((len-offset) <= infixLen) {
>            break;
>         }
>     }
>
> Mine has an extra loop that can switched to memcpy which can do
> smarter things than a loop knowing the size ahead of time.
>
> It just so happens that msvcrt's memcpy has special code for moves of
> different sizes:
>
> In the case of copying 9 bytes
>
> It looks like it can move 8 bytes in only a few instructions instead
> of each individual byte that the loop would do
>
> MoveSmall9:
> 000007FEC5F53EC8  movzx       rax,byte ptr [rdx]
> 000007FEC5F53ECC  mov         rcx,qword ptr [rdx+1]
> 000007FEC5F53ED0  mov         byte ptr [r10],al
> 000007FEC5F53ED3  mov         qword ptr [r10+1],rcx
> 000007FEC5F53ED7  mov         rax,r11
> 000007FEC5F53EDA  ret
>
>
> Compared to:
>
> MoveSmall8:
> 000007FEC5F53EBE  mov         rax,qword ptr [rdx]
> 000007FEC5F53EC1  mov         qword ptr [r10],rax
> 000007FEC5F53EC4  mov         rax,r11
> 000007FEC5F53EC7  ret
>
> Or
>
> MoveSmall10:
> 000007FEC5F53EDB  movzx       rax,word ptr [rdx]
> 000007FEC5F53EDF  mov         rcx,qword ptr [rdx+2]
> 000007FEC5F53EE3  mov         word ptr [r10],ax
> 000007FEC5F53EE7  mov         qword ptr [r10+2],rcx
> 000007FEC5F53EEB  mov         rax,r11
> 000007FEC5F53EEE  ret
>
>
> So as I suspected, a smarter algorithm and a better C implementation.
> Having C code generation still opens up many doors but definitely
> should be used sparingly since J's algorithms are finely tuned and are
> compiled with an optimized library / compiler
>
>
>
>
>
>
> On Wed, Feb 4, 2015 at 3:17 PM, Raul Miller <[email protected]> wrote:
>> Probably should delete the {. from the line declaring the bs value.
>>
>> That's an [hopefully] irrelevant artifact from an earlier version of the 
>> code.
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Wed, Feb 4, 2015 at 1:43 PM, Joe Bogner <[email protected]> wrote:
>>> Raul,
>>>
>>> That is an elegant answer and one I'll add to my notes to reference in
>>> the future.
>>>
>>> It takes approximately 50 seconds to process 240 million 9 character
>>> infixes using 50 million character blocks.
>>>
>>> As a comparison, I wrote an integration to TCC which enables
>>> generating and calling C code from J
>>>
>>> https://github.com/joebo/lang-lab/blob/master/j/tcc.ijs
>>>
>>> It could use some cleanup and it's not tested on linux or win32
>>>
>>> It includes everything to build the stub dll and then some examples.
>>>
>>> Here's an example of creating infixes on a character array:
>>>
>>> testInfix_code=: 0 : 0
>>> #include <stdio.h>
>>>
>>> void free2(long ptr) {
>>>  free(ptr);
>>> }
>>>
>>> long infix(long pmem, long len, int infixLen, long long *out) {
>>>     long long allocLen = len*sizeof(char)*infixLen;
>>>     char *newMem = (char*)malloc(allocLen);
>>>     char *mem = (char*)pmem;
>>>
>>>     printf("len: %d\n", len);
>>>     fflush(stdout);
>>>     long long offset = 0;
>>>     long long idx = 0;
>>>
>>>     for(long long i=0;i<len && ((len-offset) >= infixLen);i++) {
>>>         for(long q=0;q<infixLen;q++) {
>>>             newMem[idx++] = mem[q+offset];
>>>         }
>>>         offset++;
>>>     }
>>>
>>>     *out = idx;
>>>     return newMem;
>>> }
>>> )
>>>
>>> To run this:
>>>
>>> infix=: 3 : 0
>>> txt=.y
>>> addr=. mema 4*(#txt)
>>> txt memw addr,0,(#txt)
>>> infixSize=.9
>>> tcc=:init_tcc_''
>>> compile_tcc_ < testInfix_code
>>> ret=.execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize
>>> smoutput ret
>>> 'memPtr size'=: ret
>>> output =: memr memPtr,0,size
>>> execInt_tcc_ 'free';memPtr;0
>>> free_tcc_''
>>> output
>>> )
>>>
>>> testInfix =: 3 : 0
>>> txt=. 'abcdefghijklmnopqrstuvwxyz'
>>> infix txt
>>> )
>>>
>>>   _9[\ testInfix''
>>> len: 26
>>> +--------+---+
>>> |15900816|162|
>>> +--------+---+
>>> abcdefghi
>>> bcdefghij
>>> cdefghijk
>>> defghijkl
>>> efghijklm
>>> fghijklmn
>>> ghijklmno
>>> hijklmnop
>>> ijklmnopq
>>> jklmnopqr
>>> klmnopqrs
>>> lmnopqrst
>>> mnopqrstu
>>> nopqrstuv
>>> opqrstuvw
>>> pqrstuvwx
>>> qrstuvwxy
>>> rstuvwxyz
>>>
>>>
>>>
>>> I ended up running into problems with sizes greater than 75 million,
>>> probably due to a long size issue.
>>>
>>> Funny enough, it's actually slower than J's version by a factor of 4.
>>>
>>>    6!:2 'b1=:_9[\(infix 50e6{. txt)'
>>> 1.82504
>>>    6!:2 'b2=:9[\ 50e6{. txt'
>>> 0.431845
>>>    b1-:b2
>>> 1
>>>
>>>
>>>
>>> This could be a combination of TCC's ability to optimize code or it
>>> could be J's algorithm is implemented more efficiently than mine
>>> (probably both)
>>>
>>> The actual JIT operation only takes about 5/100 of a second
>>>
>>> I'm sure I could profile it and speed it up, but I've already spent
>>> considerably more time than desired on this "challenge".
>>>
>>> On the upside, we now have a prototype for generating C code from J
>>>
>>>
>>>
>>> On Tue, Feb 3, 2015 at 8:05 PM, Raul Miller <[email protected]> wrote:
>>>> Here's how I'd implement the overlapped block thing.
>>>>
>>>> overlapblock=:2 :0
>>>> :
>>>>   'size overlap init'=. x
>>>>   bs=. size*i.>.(#y)%{.size
>>>>   bl=. bs-~(#y)<.bs+size+overlap
>>>>   accum=. init
>>>>   for_sl. bs,.bl do.
>>>>     accum=. accum u v ((+i.)/sl){y
>>>>   end.
>>>> )
>>>>
>>>>    #(10;9;i.0 0) ~.@, overlapblock (~.@(9&(]\))) 30$'abcdefghij'
>>>> 10
>>>>
>>>> Of course change the value for x and y to match what you want to do...
>>>> And maybe recode to swap u and v, if that looks better.
>>>>
>>>> If that's too slow, it might be better to gather intermediate results
>>>> from each block and then post-process them in a single pass?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Raul
>>>>
>>>> On Tue, Feb 3, 2015 at 6:24 PM, Joe Bogner <[email protected]> wrote:
>>>>> On Tue, Feb 3, 2015 at 6:17 PM, Raul Miller <[email protected]> wrote:
>>>>>> So you are working with non-overlapping infixes?
>>>>>>
>>>>>>    _9[\'abcdefghijklmnopqrstuvwxyz'
>>>>>> abcdefghi
>>>>>> jklmnopqr
>>>>>> stuvwxyz
>>>>>>
>>>>>
>>>>>
>>>>> Sorry, no that was a typo. I can use non-overlapping to split up into
>>>>> blocks, but I need overlapping for the string.
>>>>>
>>>>> It is overlapping infixes as I am calculating the unique 9 character 
>>>>> substrings
>>>>>
>>>>> 9[\abc
>>>>> abcdefghi
>>>>> bcdefghij
>>>>> cdefghijk
>>>>> defghijkl
>>>>> efghijklm
>>>>> fghijklmn
>>>>> ghijklmno
>>>>> hijklmnop
>>>>> ijklmnopq
>>>>> jklmnopqr
>>>>> klmnopqrs
>>>>> lmnopqrst
>>>>> mnopqrstu
>>>>> nopqrstuv
>>>>> opqrstuvw
>>>>> pqrstuvwx
>>>>> qrstuvwxy
>>>>> rstuvwxyz
>>>>>
>>>>> # ~. 9[\abc
>>>>> 18
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] quickest looping construct

Reply via email to