Yes... as is so often the case, the primary benefit of the exercise is
an increased understanding.

Speaking of which, what are you using to get the assembly
representations of the memcpy internals?

Thanks,

-- 
Raul


On Wed, Feb 4, 2015 at 3:54 PM, Joe Bogner <[email protected]> wrote:
> I chased down the difference in the performance in my infix and J's infix
>
> As I figured, J's algorithm was more efficient. I implemented the same
> algorithm and cut the time by nearly a factor of 2 to 0.78 seconds.
> Still behind J's 0.43 seconds. Inspecting the assembly shows that TCC
> is executing similar instructions but somewhat different. Also, I
> might assume that Visual C's memcpy is more optimized than the TCC
> version (see below)
>
> In any case, this serves as a reminder as to why interpreted languages
> can be faster than compiled ones. J's algorithm was written more
> efficiently than mine:
>
> J (effectively)
>
>     char *x=newMem;
>     char *y=mem;
>     for(int i=0;i<d;i++) {
>         memcpy(x,y,infixLen);
>         x+=infixLen;
>         y+=1;
>     }
>
> My attempt:
>
>     long offset = 0;
>     long idx = 0;
>
>     while(1) {
>         for(int q=0;q<infixLen;q++) {
>             newMem[idx++] = mem[q+offset];
>         }
>         offset++;
>         if ((len-offset) <= infixLen) {
>            break;
>         }
>     }
>
> Mine has an extra loop that can switched to memcpy which can do
> smarter things than a loop knowing the size ahead of time.
>
> It just so happens that msvcrt's memcpy has special code for moves of
> different sizes:
>
> In the case of copying 9 bytes
>
> It looks like it can move 8 bytes in only a few instructions instead
> of each individual byte that the loop would do
>
> MoveSmall9:
> 000007FEC5F53EC8  movzx       rax,byte ptr [rdx]
> 000007FEC5F53ECC  mov         rcx,qword ptr [rdx+1]
> 000007FEC5F53ED0  mov         byte ptr [r10],al
> 000007FEC5F53ED3  mov         qword ptr [r10+1],rcx
> 000007FEC5F53ED7  mov         rax,r11
> 000007FEC5F53EDA  ret
>
>
> Compared to:
>
> MoveSmall8:
> 000007FEC5F53EBE  mov         rax,qword ptr [rdx]
> 000007FEC5F53EC1  mov         qword ptr [r10],rax
> 000007FEC5F53EC4  mov         rax,r11
> 000007FEC5F53EC7  ret
>
> Or
>
> MoveSmall10:
> 000007FEC5F53EDB  movzx       rax,word ptr [rdx]
> 000007FEC5F53EDF  mov         rcx,qword ptr [rdx+2]
> 000007FEC5F53EE3  mov         word ptr [r10],ax
> 000007FEC5F53EE7  mov         qword ptr [r10+2],rcx
> 000007FEC5F53EEB  mov         rax,r11
> 000007FEC5F53EEE  ret
>
>
> So as I suspected, a smarter algorithm and a better C implementation.
> Having C code generation still opens up many doors but definitely
> should be used sparingly since J's algorithms are finely tuned and are
> compiled with an optimized library / compiler
>
>
>
>
>
>
> On Wed, Feb 4, 2015 at 3:17 PM, Raul Miller <[email protected]> wrote:
>> Probably should delete the {. from the line declaring the bs value.
>>
>> That's an [hopefully] irrelevant artifact from an earlier version of the 
>> code.
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Wed, Feb 4, 2015 at 1:43 PM, Joe Bogner <[email protected]> wrote:
>>> Raul,
>>>
>>> That is an elegant answer and one I'll add to my notes to reference in
>>> the future.
>>>
>>> It takes approximately 50 seconds to process 240 million 9 character
>>> infixes using 50 million character blocks.
>>>
>>> As a comparison, I wrote an integration to TCC which enables
>>> generating and calling C code from J
>>>
>>> https://github.com/joebo/lang-lab/blob/master/j/tcc.ijs
>>>
>>> It could use some cleanup and it's not tested on linux or win32
>>>
>>> It includes everything to build the stub dll and then some examples.
>>>
>>> Here's an example of creating infixes on a character array:
>>>
>>> testInfix_code=: 0 : 0
>>> #include <stdio.h>
>>>
>>> void free2(long ptr) {
>>>  free(ptr);
>>> }
>>>
>>> long infix(long pmem, long len, int infixLen, long long *out) {
>>>     long long allocLen = len*sizeof(char)*infixLen;
>>>     char *newMem = (char*)malloc(allocLen);
>>>     char *mem = (char*)pmem;
>>>
>>>     printf("len: %d\n", len);
>>>     fflush(stdout);
>>>     long long offset = 0;
>>>     long long idx = 0;
>>>
>>>     for(long long i=0;i<len && ((len-offset) >= infixLen);i++) {
>>>         for(long q=0;q<infixLen;q++) {
>>>             newMem[idx++] = mem[q+offset];
>>>         }
>>>         offset++;
>>>     }
>>>
>>>     *out = idx;
>>>     return newMem;
>>> }
>>> )
>>>
>>> To run this:
>>>
>>> infix=: 3 : 0
>>> txt=.y
>>> addr=. mema 4*(#txt)
>>> txt memw addr,0,(#txt)
>>> infixSize=.9
>>> tcc=:init_tcc_''
>>> compile_tcc_ < testInfix_code
>>> ret=.execIntOutInt_tcc_ 'infix';addr;(#txt);infixSize
>>> smoutput ret
>>> 'memPtr size'=: ret
>>> output =: memr memPtr,0,size
>>> execInt_tcc_ 'free';memPtr;0
>>> free_tcc_''
>>> output
>>> )
>>>
>>> testInfix =: 3 : 0
>>> txt=. 'abcdefghijklmnopqrstuvwxyz'
>>> infix txt
>>> )
>>>
>>>   _9[\ testInfix''
>>> len: 26
>>> +--------+---+
>>> |15900816|162|
>>> +--------+---+
>>> abcdefghi
>>> bcdefghij
>>> cdefghijk
>>> defghijkl
>>> efghijklm
>>> fghijklmn
>>> ghijklmno
>>> hijklmnop
>>> ijklmnopq
>>> jklmnopqr
>>> klmnopqrs
>>> lmnopqrst
>>> mnopqrstu
>>> nopqrstuv
>>> opqrstuvw
>>> pqrstuvwx
>>> qrstuvwxy
>>> rstuvwxyz
>>>
>>>
>>>
>>> I ended up running into problems with sizes greater than 75 million,
>>> probably due to a long size issue.
>>>
>>> Funny enough, it's actually slower than J's version by a factor of 4.
>>>
>>>    6!:2 'b1=:_9[\(infix 50e6{. txt)'
>>> 1.82504
>>>    6!:2 'b2=:9[\ 50e6{. txt'
>>> 0.431845
>>>    b1-:b2
>>> 1
>>>
>>>
>>>
>>> This could be a combination of TCC's ability to optimize code or it
>>> could be J's algorithm is implemented more efficiently than mine
>>> (probably both)
>>>
>>> The actual JIT operation only takes about 5/100 of a second
>>>
>>> I'm sure I could profile it and speed it up, but I've already spent
>>> considerably more time than desired on this "challenge".
>>>
>>> On the upside, we now have a prototype for generating C code from J
>>>
>>>
>>>
>>> On Tue, Feb 3, 2015 at 8:05 PM, Raul Miller <[email protected]> wrote:
>>>> Here's how I'd implement the overlapped block thing.
>>>>
>>>> overlapblock=:2 :0
>>>> :
>>>>   'size overlap init'=. x
>>>>   bs=. size*i.>.(#y)%{.size
>>>>   bl=. bs-~(#y)<.bs+size+overlap
>>>>   accum=. init
>>>>   for_sl. bs,.bl do.
>>>>     accum=. accum u v ((+i.)/sl){y
>>>>   end.
>>>> )
>>>>
>>>>    #(10;9;i.0 0) ~.@, overlapblock (~.@(9&(]\))) 30$'abcdefghij'
>>>> 10
>>>>
>>>> Of course change the value for x and y to match what you want to do...
>>>> And maybe recode to swap u and v, if that looks better.
>>>>
>>>> If that's too slow, it might be better to gather intermediate results
>>>> from each block and then post-process them in a single pass?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Raul
>>>>
>>>> On Tue, Feb 3, 2015 at 6:24 PM, Joe Bogner <[email protected]> wrote:
>>>>> On Tue, Feb 3, 2015 at 6:17 PM, Raul Miller <[email protected]> wrote:
>>>>>> So you are working with non-overlapping infixes?
>>>>>>
>>>>>>    _9[\'abcdefghijklmnopqrstuvwxyz'
>>>>>> abcdefghi
>>>>>> jklmnopqr
>>>>>> stuvwxyz
>>>>>>
>>>>>
>>>>>
>>>>> Sorry, no that was a typo. I can use non-overlapping to split up into
>>>>> blocks, but I need overlapping for the string.
>>>>>
>>>>> It is overlapping infixes as I am calculating the unique 9 character 
>>>>> substrings
>>>>>
>>>>> 9[\abc
>>>>> abcdefghi
>>>>> bcdefghij
>>>>> cdefghijk
>>>>> defghijkl
>>>>> efghijklm
>>>>> fghijklmn
>>>>> ghijklmno
>>>>> hijklmnop
>>>>> ijklmnopq
>>>>> jklmnopqr
>>>>> klmnopqrs
>>>>> lmnopqrst
>>>>> mnopqrstu
>>>>> nopqrstuv
>>>>> opqrstuvw
>>>>> pqrstuvwx
>>>>> qrstuvwxy
>>>>> rstuvwxyz
>>>>>
>>>>> # ~. 9[\abc
>>>>> 18
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to