I think this code is much quicker on 0.4, simply because it leaves UInt8s 
as UInt8s - way less conversion of types.

Adding:
function bench()
  in_ = hex2bytes("00112233445566778899aabbccddeeff")
  key = hex2bytes("000102030405060708090a0b0c0d0e0f")
  prm = aes_get_cipher_params(key)
  keys = gen_key_schedule(key, prm)
  keyr = reshape(keys, 4, 4, 11)

  inr = reshape(in_, 4, 4)
  @time for i in 1:100
    rijndael(inr, keyr, prm)
  end
end

bench()
bench()
bench()

On 0.4-RC1:
0.000644 seconds (4.10 k allocations: 285.938 KB)
0.000638 seconds (4.10 k allocations: 285.938 KB)

On 0.3.11
elapsed time: 0.001574148 seconds (297600 bytes allocated)
elapsed time: 0.001740806 seconds (297600 bytes allocated)

The `key_block[:,:,i]` lines are probably responsible for a lot of the 
allocation - that allocates memory every time.

On Saturday, September 12, 2015 at 1:41:41 AM UTC-4, Jeffrey Sarnoff wrote:
>
> (corrected post)
>
> Wrapping a value in a type gets you a passable, mutable value via the 
> field name:
>
>  julia> type PassMe
>           b::UInt8
>        end
> julia> useMe = PassMe(0x03); useMe.b
> 0x03
> julia> function changeMe!(pass::PassMe, newval::UInt8)
>           pass.b = newval
>           true # pass is not returned
>        end;
> julia> changeMe!(useMe, 0xff); useMe.b
> 0xff
>
>
> On Saturday, September 12, 2015 at 12:16:24 AM UTC-4, Yichao Yu wrote:
>>
>> On Fri, Sep 11, 2015 at 11:56 PM, Jeffrey Sarnoff 
>> <[email protected]> wrote: 
>> > The way to get  a non-global, global-like value is to wrap the whole 
>> thing 
>> > as a module: 
>> > module aes 
>> > exports rijnadael 
>> > 
>> > not_a_global = true 
>> > # .. functions use not_a_global ... 
>> > end # module 
>>
>> Just to be clear, `not_a_global` is a global. 
>>
>> > 
>> > Wrapping a value in a type gets you a passable, mutable value via the 
>> field 
>> > name: 
>> > 
>> >  julia> type PassMe 
>> >           b::UInt8 
>> >        end 
>> > julia> useMe = PassMe(0x03); useMe.b 
>> > 0x03 
>> > julia> function changeMe!(pass::PassMe, newval::UInt8) 
>> >           pass.b = newval 
>> >           true # pass is not returned 
>> >        end; 
>> > julia> changeMe!(useMe, 0xff); useMe.b 
>> > 0xff 
>> > 
>> > 
>> > On Friday, September 11, 2015 at 11:34:58 PM UTC-4, Stefan Karpinski 
>> wrote: 
>> >> 
>> >> Oh yeah, I didn't notice that – having those be abstractly typed is 
>> going 
>> >> to a trainwreck for performance. It should be possible to eliminate 
>> all 
>> >> allocation from this kind of code. 
>> >> 
>> >> On Fri, Sep 11, 2015 at 10:46 PM, Tom Breloff <[email protected]> 
>> wrote: 
>> >>> 
>> >>> I looked at your code for a grand total of 20 seconds, but the one 
>> thing 
>> >>> I would check is whether using concrete types (Int) in your 
>> immutable, or 
>> >>> maybe making it parametric: 
>> >>> 
>> >>> immutable AES_cipher_params{T<:Unsigned} 
>> >>>   bits::T # Cipher key length, bits 
>> >>>   nk::T # Number of 32-bit words, cipher key 
>> >>>   nb::T # Number of columns in State 
>> >>>   nr::T # Number of rounds 
>> >>>   block_size::T # byte length 
>> >>>   block_x::T # block dimensions X 
>> >>>   block_y::T # block dimensions Y 
>> >>> end 
>> >>> 
>> >>> It's possible the compiler can't infer enough types because of this? 
>> >>> Just a guess... 
>> >>> 
>> >>> On Fri, Sep 11, 2015 at 10:38 PM, Corey Moncure <[email protected]> 
>>
>> >>> wrote: 
>> >>>> 
>> >>>> The creator has descended from the shining heavens and responded to 
>> my 
>> >>>> post.  Cool. 
>> >>>> 
>> >>>> Here are some stats naively gathered from taking time() before and 
>> after 
>> >>>> the run and subtracting.  Second run is with gc_disable().  A 1.26x 
>> speedup 
>> >>>> is noted. 
>> >>>> 
>> >>>> elapsed (s): 0.0916 
>> >>>> Throughput, KB/s: 1705.97 
>> >>>> Average time (μs) per iteration: 0.0 
>> >>>> Estimated cycles / iteration @ 4.0 GHz: 36636.0 
>> >>>> 
>> >>>> elapsed (s): 0.0727 
>> >>>> Throughput, KB/s: 2149.6 
>> >>>> Average time (μs) per iteration: 7.2688 
>> >>>> Estimated cycles / iteration @ 4.0 GHz: 29075.0 
>> >>>> 
>> >>>> 
>> >>>> I'd like to know how to minimize the effect of the garbage collector 
>> and 
>> >>>> allocations.  The algorithm is a tight loop with a handful of tiny 
>> >>>> functions, none of which ought to require much allocation.  A few 
>> variables 
>> >>>> for placeholder data are inevitable.  But I have read the warnings 
>> >>>> discouraging the use of global state.  What is the Julia way to 
>> allocate a 
>> >>>> few bytes of scratch memory that can be accessed within the scope 
>> of, say, 
>> >>>> apply_ECB_mode!() without having to be reallocated each time 
>> gf_mult() or 
>> >>>> mix_columns! are called? 
>> >>>> 
>> >>>> Also, can a function be forced to mutate variables passed to it, 
>> >>>> eliminating a redundant assignment?  Julia seems happy to do this 
>> with 
>> >>>> Arrays but not unit primitives.  Pass by reference (pointer)?  Does 
>> the LLVM 
>> >>>> appreciate this? 
>> >>>> 
>> >>>> On Friday, September 11, 2015 at 6:15:52 PM UTC-4, Stefan Karpinski 
>> >>>> wrote: 
>> >>>>> 
>> >>>>> There's nothing obviously glaring here. I would definitely 
>> recommend 
>> >>>>> using the built-in profiler to see where it's spending its time. 
>> There may 
>> >>>>> be subtle type instabilities or some other non-obvious issue. You 
>> definitely 
>> >>>>> ought to be able to get within striking distance of similar C code, 
>> which 
>> >>>>> should be in the expected 4-10x slower than hand-coded assembly. 
>> >>>>> 
>> >>>>> On Fri, Sep 11, 2015 at 5:10 PM, Corey Moncure <[email protected]> 
>>
>> >>>>> wrote: 
>> >>>>>> 
>> >>>>>> https://github.com/cmoncure/crypto/blob/master/aes.jl 
>> >>>>>> 
>> >>>>>> In the process of learning Julia (and crypto) I implemented the 
>> >>>>>> Rijndael block cipher and inverse cipher.  I tried to write 
>> idiomatic yet 
>> >>>>>> concise code, but the performance is not very desirable.  On my 
>> machine 
>> >>>>>> (i5-2500k @ 4.0 Ghz) the throughput is piddling, on the order of 
>> 10e6 
>> >>>>>> bytes/sec, and memory allocation is at 3056 bytes / block, which I 
>> have not 
>> >>>>>> been able to cut down any further. 
>> >>>>>> 
>> >>>>>> Obviously I do not intend to compete with hand-tuned assembler 
>> >>>>>> routines that heavily exploit SIMD and pre-computed tables, but my 
>> >>>>>> completely unfounded gut feeling is that given the right input, 
>> Julia should 
>> >>>>>> be able to approach within a factor of 4-10 without such 
>> optimizations. 
>> >>>>>> Currently this routine is within a factor of 1000. 
>> >>>>>> 
>> >>>>>> Any Julia experts out there willing to take a peek at the code and 
>> >>>>>> offer some tips for idiomatic (i.e. within the framework of Julia 
>> syntax and 
>> >>>>>> style) optimizations? 
>> >>>>>> 
>> >>>>>> In the course of doing this I have run into several gripes with 
>> Julia, 
>> >>>>>> particularly some of the built-in functions which are often 
>> confusing or 
>> >>>>>> contradictory by virtue of the type declarations of certain 
>> methods (or lack 
>> >>>>>> of needed ones).  For instance, Julia does not support negative 
>> indexing of 
>> >>>>>> arrays... so then why do so many functions on arrays take only 
>> signed 
>> >>>>>> integer types for dimensions?  To the noobie it seems like an 
>> obvious choice 
>> >>>>>> to type data holding the calculation of matrix dimensions or 
>> indices as 
>> >>>>>> unsigned integers, given that the language does not support 
>> negative 
>> >>>>>> indexing.  Yet this fails unexpectedly in many built-ins such as 
>> sub(). 
>> >>>>>> 
>> >>>>>> 
>> >>>>> 
>> >>> 
>> >> 
>> > 
>>
>

Reply via email to