Re: [Jchat] J subset compilation

Joe Bogner Fri, 28 Feb 2014 12:50:30 -0800

Raul, your draft outline matches what I was thinking as well.

I was thinking about the benefits. The first benefit I thought of,
naively, was performance.


The others were:
- standalone deployment
- simplify integration with C code/libraries(?) where J is just a small part[1]
- for sandbox / security
- for learning

There's probably others that aren't jumping to mind at the moment.

In terms of performance, I am not sure how much gain there will be.
On my machine, I can sum a large array very quickly: +/ +/"1 (1e8 4 $
1 2 3 4) in less than a second or so.  When would extreme performance
for simple operations be needed? J's execution speed is very good. We
could maybe save a tiny fraction of time in the parsing of the
statements.

For these simple operations, I did some exploration on what gains
could be possible.

I ran jconsole under jdb and did some tracing of the assembly.

My baseline # of instruction is 87,314 to execute a statement. Don't
take this number literally as it's more important as a relative number

I put a breakpoint on jtdivide and started counting instructions after
until Jdo ends

$ gdb --batch --command=script --args ./jconsole -js "smoutput 'hi
bye'[b=:5%5" > log.txt
$ wc -l log.txt
261944 log.txt


To sum up 10,000 items in an array, it's roughly 115,000 instructions.

$ gdb --batch --command=script --args ./jconsole -js "(smoutput
+/c)[c=:i. 10000[smoutput 'hi bye'[b=:5%5" > log.txt
$ wc -l log.txt
607407 log.txt


To sum up 100,000 items in an array, it's roughly 1,015,238 instructions
$ gdb --batch --command=script --args ./jconsole -js "(smoutput
+/c)[c=:i. 100000[smoutput 'hi bye'[b=:5%5" > log.txt
$ wc -l log.txt
3307659 log.txt

I don't know if it's a valid comparison, but an I7 does supposedly
roughly  82,700 million instructions per second. I'm sure not all are
equal, because 1e7 takes 0.089s which should have been 101M
instructions

$ time ./jconsole -js "exit''[(smoutput +/c)[c=:i. 1e7[smoutput 'hi bye'[b=:5%5"
49999995000000

real    0m0.089s
user    0m0.028s
sys     0m0.056s

What are these instructions? From jtiota, which calls jtapv and
asmplusr in vasm.h

To populate the array - 5 instructions per item

=> 0x7ffff7166323 <jtapv+195>:  inc    %rbx
=> 0x7ffff7166326 <jtapv+198>:  mov    %rbx,(%rdx)
=> 0x7ffff7166329 <jtapv+201>:  add    $0x8,%rdx
=> 0x7ffff716632d <jtapv+205>:  cmp    %r14,%rbx
=> 0x7ffff7166330 <jtapv+208>:  jne    0x7ffff7166323 <jtapv+195>

To sum up the array - 5 instructions per item

308601 => 0x7ffff72a64e2 <asmplusr+2>: dec    %rdi
308604 => 0x7ffff72a64e5 <asmplusr+5>: add    (%rdx,%rdi,8),%rax
308607 => 0x7ffff72a64e9 <asmplusr+9>: jo     0x7ffff72a64f6 <asmplusr+22>
308610 => 0x7ffff72a64eb <asmplusr+11>:        test   %rdi,%rdi
308613 => 0x7ffff72a64ee <asmplusr+14>:        jg     0x7ffff72a64e2
<asmplusr+2>

This matches this tight loop:

#define PLUSR(n,z,x)                    \
...
__asm  pr20: add  eax,[esi+ecx*4]       \
__asm        jo   pr30                  \
__asm        loop pr20                  \


clang/LLVM uses SSE extensions on my CPU to generate more optimized code:

I wrote a little C program to mimic iota and generated the assembly output:

    int *c = malloc(sizeof(int)*1000000);
    for(int i = 0;i<1000000;i++) {
      c[i] = i;
    }

To populate the array - It looks like more instructions, but it's
blocks of 8 elements

So 9 * 1e6 / 8 instead of  5 per element in the ASM/C profiled from J.
I don't really know how to compare here and I wasn't able to easily
instrument it because clang/LLVM on linux generated non-vector / SSE
instructions.

LBB1_1:
movd %esi, %xmm2
pshufd $0, %xmm2, %xmm2
movdqa %xmm2, %xmm3
paddd %xmm0, %xmm3
paddd %xmm1, %xmm2
movdqu %xmm3, (%eax,%esi,4)
movdqu %xmm2, 16(%eax,%esi,4)
addl $8, %esi
cmpl $1000000, %esi
jne LBB1_1


To sum up the array - also blocks of 8 elements

int sum_elements(int *A, int n) {
  int sum = 0;
  for (int i = 0; i < n; ++i)
    sum += A[i];
  return sum;
}


LBB0_4:
movdqa %xmm1, %xmm2
movdqa %xmm0, %xmm3
movdqu (%edx,%eax,4), %xmm0
movdqu 16(%edx,%eax,4), %xmm1
paddd %xmm3, %xmm0
paddd %xmm2, %xmm1
addl $8, %eax
cmpl %eax, %esi
jne LBB0_4


So in terms of performance for simple array operations, I think there
may be 1-8x improvement from the vector operations. I'm curious, when
will this matter? I started toying with audio processing, which was
roughly 65,000 elements a second. Based on my estimates above, it
seems like J would be able to keep up with that fine, depending on
what's being done with it. Again, execution speed seems great with J.
Maybe it would matter for real time image processing (from a webcam or
something)...

Still, I think the other benefits may justify putting more time into
it, even if performance isn't a huge improvement.



[1] - The J library already supports this, but maybe not at a high
enough performance. Unsure
[2] - http://en.wikipedia.org/wiki/Instructions_per_second
[3] - http://blog.llvm.org/2013/05/llvm-33-vectorization-improvements.html

On Fri, Feb 28, 2014 at 1:38 PM, Raul Miller <[email protected]> wrote:
> I had been thinking gcc rtl - I've built gcc a number of times and while it
> can be complicated, most of the complications are irrelevant during the
> early development of a project. In particular, there are a number of
> subtleties which matter for cross compilation - where the compiler cannot
> know what it is compiling for just by looking at itself - which should not
> matter for the first few drafts of a compiler. Those abstractions can come
> later.
>
> Still, that might mean that llvm is better for us than rtl, simply for its
> focus on native compilation (as opposed to cross compilation).
>
> Of course a third path involves something like CUDA (which, in its current
> state of development can only be supported through cross compilation, and
> which also needs a certain amount of native support, to get anything useful
> done).
>
> I imagine that an LLVM subset would focus on rank zero operations with a
> small amount of rank 1 and rank _ operations. CUDA would probably focus on
> rank 1 and rank 2 (I'm not certain whether CUDA could efficiently implement
> dyadic transpose, or boxed arrays - there's better things to start on).
>
> Actually... probably the best place to start with a language subset would
> be:
>
> First draft:
> rank 0 only
> + - * dyads, - monad, only one data type (floating point, probably)
>
> Second draft:
> rank 1 arrays and indexing
>
> Third draft:
> whatever you need to implement ;: (mostly this means arbitrarily ranked
> arrays - basically a pair of rank 1 arrays and integer division and modulo).
>
> Fourth draft:
> First pass at J parsing (but no names yet, no adverbs yet, no conjunctions
> yet)
> http://www.jsoftware.com/help/dictionary/dicte.htm
>
> Mostly these passes would be practice. But once you had the parser you
> could start using the tests which come packaged in the J sources. You would
> not so much use those in order (and some would never be done successfully -
> for a J subset you'd identify the fraction of the tests that your subset
> could handle) but you would use them to help you find things which can be
> done (relatively) easily which fit what you already have.
>
> Basically to manage a large body of code, you wind up using errors of some
> sort (compilation errors, testing errors, or whatever else) to help you
> focus on your next steps.
>
> Thanks,
>
> --
> Raul
>
>
>
>
> On Thu, Feb 27, 2014 at 10:19 PM, Joe Bogner <[email protected]> wrote:
>
>> I started to toy with the idea of working on a subset of J that can be
>> compiled. The idea has been brought up before in the past. Yes, I
>> realize it's a path fraught with peril and a likely premature demise.
>> Still, as a toy, I figure it would be fun.
>>
>> Options for intermediate language:
>>
>> 1. GCC RTL - It looks very challenging and as far as I can tell would
>> require recompiling GCC, which is something I don't want to do
>>
>> 2. LLVM IR - I cobbled together a small example that emits IR which
>> can then be compiled. Hypothetically, we could use the LLVM C bindings
>> to generate IR from J words directly from J. For my proof of concept,
>> I used fsharp bindings since that seemed to be the easiest path on
>> wndows. Julia, Rust, and Haskell have LLVM targets.
>>
>> 3. C - Either clang, gcc or tinycc. Generating C doesn't seem as
>> stable and professional grade as the earlier options. However, I have
>> come across decent languages that do it (Chicken Scheme and others).
>> ELI does it, http://fastarray.appspot.com/compile.html. APEX also does
>> it, http://www.snakeisland.com/apexup.htm. Could be used to also JIT
>> code from J for small functions.
>>
>> I'm currently edging towards the C path. Maybe that makes more sense
>> as a little proof of concept since it seems like it'd be quicker to
>> get up and running. My vision here is to implement a subset of verbs
>> in C ({ i. # $) .. really basic .. and then have a parser that takes a
>> series of J expressions and translates them into the corresponding
>> running list of C function calls - all in a single main().
>>
>> Some things I'm excited about trying out:
>>
>> 1. clang/llvm vectorization -
>> http://blog.llvm.org/2013/05/llvm-33-vectorization-improvements.html
>>
>> 2. parallel computation and concurrency -
>> http://opensource.mlba-team.de/xdispatch/docs/current/index.html
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] J subset compilation

Reply via email to