Re: [Jchat] J subset compilation

Devon McCormick Fri, 28 Feb 2014 14:39:02 -0800

As far as I'm concerned, the first rule of code optimization is: find the
bottleneck.


J performs very well on large, uniform arrays.  When I've looked at
optimized C code, one of the potentially big gains seems to come from C's
ability to bail out of a deeply-nested loop as soon as a solution is found.
 This only applies to certain kinds of problems - ones with an
easily-verifiable best solution - but is the sort of thing an array
language has a hard time emulating.

In theory, you could translate any piece of J code into the set of C
routines that underlies the relevant primitives.  The hard part seems to be
doing this automatically.  Part of the difficulty is that J will handle any
number of edge conditions, particularly those requiring type promotion and
dynamically-sized arrays, that must be accounted for in the compiled
solution - which complicates the code.

One possible shortcut I've thought of is to incorporate information from
running the J code with relevant test cases.  If the code could possibly
need to handle floating point but really only ever deals with integers,
this would simplify the compiled code you'd have to generate.  So, instead
of only looking at the J code, you would have information on some typical
use cases to help you prune the code tree.  This would sacrifice generality
in order to simplify code generation.



On Fri, Feb 28, 2014 at 3:50 PM, Joe Bogner <[email protected]> wrote:

> Raul, your draft outline matches what I was thinking as well.
>
> I was thinking about the benefits. The first benefit I thought of,
> naively, was performance.
>
> The others were:
> - standalone deployment
> - simplify integration with C code/libraries(?) where J is just a small
> part[1]
> - for sandbox / security
> - for learning
>
> There's probably others that aren't jumping to mind at the moment.
>
> In terms of performance, I am not sure how much gain there will be.
> On my machine, I can sum a large array very quickly: +/ +/"1 (1e8 4 $
> 1 2 3 4) in less than a second or so.  When would extreme performance
> for simple operations be needed? J's execution speed is very good. We
> could maybe save a tiny fraction of time in the parsing of the
> statements.
>
> For these simple operations, I did some exploration on what gains
> could be possible.
>
> I ran jconsole under jdb and did some tracing of the assembly.
>
> My baseline # of instruction is 87,314 to execute a statement. Don't
> take this number literally as it's more important as a relative number
>
> I put a breakpoint on jtdivide and started counting instructions after
> until Jdo ends
>
> $ gdb --batch --command=script --args ./jconsole -js "smoutput 'hi
> bye'[b=:5%5" > log.txt
> $ wc -l log.txt
> 261944 log.txt
>
>
> To sum up 10,000 items in an array, it's roughly 115,000 instructions.
>
> $ gdb --batch --command=script --args ./jconsole -js "(smoutput
> +/c)[c=:i. 10000[smoutput 'hi bye'[b=:5%5" > log.txt
> $ wc -l log.txt
> 607407 log.txt
>
>
> To sum up 100,000 items in an array, it's roughly 1,015,238 instructions
> $ gdb --batch --command=script --args ./jconsole -js "(smoutput
> +/c)[c=:i. 100000[smoutput 'hi bye'[b=:5%5" > log.txt
> $ wc -l log.txt
> 3307659 log.txt
>
> I don't know if it's a valid comparison, but an I7 does supposedly
> roughly  82,700 million instructions per second. I'm sure not all are
> equal, because 1e7 takes 0.089s which should have been 101M
> instructions
>
> $ time ./jconsole -js "exit''[(smoutput +/c)[c=:i. 1e7[smoutput 'hi
> bye'[b=:5%5"
> 49999995000000
>
> real    0m0.089s
> user    0m0.028s
> sys     0m0.056s
>
> What are these instructions? From jtiota, which calls jtapv and
> asmplusr in vasm.h
>
> To populate the array - 5 instructions per item
>
> => 0x7ffff7166323 <jtapv+195>:  inc    %rbx
> => 0x7ffff7166326 <jtapv+198>:  mov    %rbx,(%rdx)
> => 0x7ffff7166329 <jtapv+201>:  add    $0x8,%rdx
> => 0x7ffff716632d <jtapv+205>:  cmp    %r14,%rbx
> => 0x7ffff7166330 <jtapv+208>:  jne    0x7ffff7166323 <jtapv+195>
>
> To sum up the array - 5 instructions per item
>
> 308601 => 0x7ffff72a64e2 <asmplusr+2>: dec    %rdi
> 308604 => 0x7ffff72a64e5 <asmplusr+5>: add    (%rdx,%rdi,8),%rax
> 308607 => 0x7ffff72a64e9 <asmplusr+9>: jo     0x7ffff72a64f6 <asmplusr+22>
> 308610 => 0x7ffff72a64eb <asmplusr+11>:        test   %rdi,%rdi
> 308613 => 0x7ffff72a64ee <asmplusr+14>:        jg     0x7ffff72a64e2
> <asmplusr+2>
>
> This matches this tight loop:
>
> #define PLUSR(n,z,x)                    \
> ...
> __asm  pr20: add  eax,[esi+ecx*4]       \
> __asm        jo   pr30                  \
> __asm        loop pr20                  \
>
>
> clang/LLVM uses SSE extensions on my CPU to generate more optimized code:
>
> I wrote a little C program to mimic iota and generated the assembly output:
>
>     int *c = malloc(sizeof(int)*1000000);
>     for(int i = 0;i<1000000;i++) {
>       c[i] = i;
>     }
>
> To populate the array - It looks like more instructions, but it's
> blocks of 8 elements
>
> So 9 * 1e6 / 8 instead of  5 per element in the ASM/C profiled from J.
> I don't really know how to compare here and I wasn't able to easily
> instrument it because clang/LLVM on linux generated non-vector / SSE
> instructions.
>
> LBB1_1:
> movd %esi, %xmm2
> pshufd $0, %xmm2, %xmm2
> movdqa %xmm2, %xmm3
> paddd %xmm0, %xmm3
> paddd %xmm1, %xmm2
> movdqu %xmm3, (%eax,%esi,4)
> movdqu %xmm2, 16(%eax,%esi,4)
> addl $8, %esi
> cmpl $1000000, %esi
> jne LBB1_1
>
>
> To sum up the array - also blocks of 8 elements
>
> int sum_elements(int *A, int n) {
>   int sum = 0;
>   for (int i = 0; i < n; ++i)
>     sum += A[i];
>   return sum;
> }
>
>
> LBB0_4:
> movdqa %xmm1, %xmm2
> movdqa %xmm0, %xmm3
> movdqu (%edx,%eax,4), %xmm0
> movdqu 16(%edx,%eax,4), %xmm1
> paddd %xmm3, %xmm0
> paddd %xmm2, %xmm1
> addl $8, %eax
> cmpl %eax, %esi
> jne LBB0_4
>
>
> So in terms of performance for simple array operations, I think there
> may be 1-8x improvement from the vector operations. I'm curious, when
> will this matter? I started toying with audio processing, which was
> roughly 65,000 elements a second. Based on my estimates above, it
> seems like J would be able to keep up with that fine, depending on
> what's being done with it. Again, execution speed seems great with J.
> Maybe it would matter for real time image processing (from a webcam or
> something)...
>
> Still, I think the other benefits may justify putting more time into
> it, even if performance isn't a huge improvement.
>
>
>
> [1] - The J library already supports this, but maybe not at a high
> enough performance. Unsure
> [2] - http://en.wikipedia.org/wiki/Instructions_per_second
> [3] - http://blog.llvm.org/2013/05/llvm-33-vectorization-improvements.html
>
> On Fri, Feb 28, 2014 at 1:38 PM, Raul Miller <[email protected]>
> wrote:
> > I had been thinking gcc rtl - I've built gcc a number of times and while
> it
> > can be complicated, most of the complications are irrelevant during the
> > early development of a project. In particular, there are a number of
> > subtleties which matter for cross compilation - where the compiler cannot
> > know what it is compiling for just by looking at itself - which should
> not
> > matter for the first few drafts of a compiler. Those abstractions can
> come
> > later.
> >
> > Still, that might mean that llvm is better for us than rtl, simply for
> its
> > focus on native compilation (as opposed to cross compilation).
> >
> > Of course a third path involves something like CUDA (which, in its
> current
> > state of development can only be supported through cross compilation, and
> > which also needs a certain amount of native support, to get anything
> useful
> > done).
> >
> > I imagine that an LLVM subset would focus on rank zero operations with a
> > small amount of rank 1 and rank _ operations. CUDA would probably focus
> on
> > rank 1 and rank 2 (I'm not certain whether CUDA could efficiently
> implement
> > dyadic transpose, or boxed arrays - there's better things to start on).
> >
> > Actually... probably the best place to start with a language subset would
> > be:
> >
> > First draft:
> > rank 0 only
> > + - * dyads, - monad, only one data type (floating point, probably)
> >
> > Second draft:
> > rank 1 arrays and indexing
> >
> > Third draft:
> > whatever you need to implement ;: (mostly this means arbitrarily ranked
> > arrays - basically a pair of rank 1 arrays and integer division and
> modulo).
> >
> > Fourth draft:
> > First pass at J parsing (but no names yet, no adverbs yet, no
> conjunctions
> > yet)
> > http://www.jsoftware.com/help/dictionary/dicte.htm
> >
> > Mostly these passes would be practice. But once you had the parser you
> > could start using the tests which come packaged in the J sources. You
> would
> > not so much use those in order (and some would never be done
> successfully -
> > for a J subset you'd identify the fraction of the tests that your subset
> > could handle) but you would use them to help you find things which can be
> > done (relatively) easily which fit what you already have.
> >
> > Basically to manage a large body of code, you wind up using errors of
> some
> > sort (compilation errors, testing errors, or whatever else) to help you
> > focus on your next steps.
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> >
> >
> > On Thu, Feb 27, 2014 at 10:19 PM, Joe Bogner <[email protected]>
> wrote:
> >
> >> I started to toy with the idea of working on a subset of J that can be
> >> compiled. The idea has been brought up before in the past. Yes, I
> >> realize it's a path fraught with peril and a likely premature demise.
> >> Still, as a toy, I figure it would be fun.
> >>
> >> Options for intermediate language:
> >>
> >> 1. GCC RTL - It looks very challenging and as far as I can tell would
> >> require recompiling GCC, which is something I don't want to do
> >>
> >> 2. LLVM IR - I cobbled together a small example that emits IR which
> >> can then be compiled. Hypothetically, we could use the LLVM C bindings
> >> to generate IR from J words directly from J. For my proof of concept,
> >> I used fsharp bindings since that seemed to be the easiest path on
> >> wndows. Julia, Rust, and Haskell have LLVM targets.
> >>
> >> 3. C - Either clang, gcc or tinycc. Generating C doesn't seem as
> >> stable and professional grade as the earlier options. However, I have
> >> come across decent languages that do it (Chicken Scheme and others).
> >> ELI does it, http://fastarray.appspot.com/compile.html. APEX also does
> >> it, http://www.snakeisland.com/apexup.htm. Could be used to also JIT
> >> code from J for small functions.
> >>
> >> I'm currently edging towards the C path. Maybe that makes more sense
> >> as a little proof of concept since it seems like it'd be quicker to
> >> get up and running. My vision here is to implement a subset of verbs
> >> in C ({ i. # $) .. really basic .. and then have a parser that takes a
> >> series of J expressions and translates them into the corresponding
> >> running list of C function calls - all in a single main().
> >>
> >> Some things I'm excited about trying out:
> >>
> >> 1. clang/llvm vectorization -
> >> http://blog.llvm.org/2013/05/llvm-33-vectorization-improvements.html
> >>
> >> 2. parallel computation and concurrency -
> >> http://opensource.mlba-team.de/xdispatch/docs/current/index.html
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Devon McCormick, CFA
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] J subset compilation

Reply via email to