The creator has descended from the shining heavens and responded to my post. Cool.
Here are some stats naively gathered from taking time() before and after the run and subtracting. Second run is with gc_disable(). A 1.26x speedup is noted. elapsed (s): 0.0916 Throughput, KB/s: 1705.97 Average time (μs) per iteration: 0.0 Estimated cycles / iteration @ 4.0 GHz: 36636.0 elapsed (s): 0.0727 Throughput, KB/s: 2149.6 Average time (μs) per iteration: 7.2688 Estimated cycles / iteration @ 4.0 GHz: 29075.0 I'd like to know how to minimize the effect of the garbage collector and allocations. The algorithm is a tight loop with a handful of tiny functions, none of which ought to require much allocation. A few variables for placeholder data are inevitable. But I have read the warnings discouraging the use of global state. What is the Julia way to allocate a few bytes of scratch memory that can be accessed within the scope of, say, apply_ECB_mode!() without having to be reallocated each time gf_mult() or mix_columns! are called? Also, can a function be forced to mutate variables passed to it, eliminating a redundant assignment? Julia seems happy to do this with Arrays but not unit primitives. Pass by reference (pointer)? Does the LLVM appreciate this? On Friday, September 11, 2015 at 6:15:52 PM UTC-4, Stefan Karpinski wrote: > > There's nothing obviously glaring here. I would definitely recommend using > the built-in profiler to see where it's spending its time. There may be > subtle type instabilities or some other non-obvious issue. You definitely > ought to be able to get within striking distance of similar C code, which > should be in the expected 4-10x slower than hand-coded assembly. > > On Fri, Sep 11, 2015 at 5:10 PM, Corey Moncure <[email protected] > <javascript:>> wrote: > >> https://github.com/cmoncure/crypto/blob/master/aes.jl >> >> In the process of learning Julia (and crypto) I implemented the Rijndael >> block cipher and inverse cipher. I tried to write idiomatic yet concise >> code, but the performance is not very desirable. On my machine (i5-2500k @ >> 4.0 Ghz) the throughput is piddling, on the order of 10e6 bytes/sec, and >> memory allocation is at 3056 bytes / block, which I have not been able to >> cut down any further. >> >> Obviously I do not intend to compete with hand-tuned assembler routines >> that heavily exploit SIMD and pre-computed tables, but my completely >> unfounded gut feeling is that given the right input, Julia should be able >> to approach within a factor of 4-10 without such optimizations. Currently >> this routine is within a factor of 1000. >> >> Any Julia experts out there willing to take a peek at the code and offer >> some tips for idiomatic (i.e. within the framework of Julia syntax and >> style) optimizations? >> >> In the course of doing this I have run into several gripes with Julia, >> particularly some of the built-in functions which are often confusing or >> contradictory by virtue of the type declarations of certain methods (or >> lack of needed ones). For instance, Julia does not support negative >> indexing of arrays... so then why do so many functions on arrays take only >> signed integer types for dimensions? To the noobie it seems like an >> obvious choice to type data holding the calculation of matrix dimensions or >> indices as unsigned integers, given that the language does not support >> negative indexing. Yet this fails unexpectedly in many built-ins such as >> sub(). >> >> >> >
