I think this code is much quicker on 0.4, simply because it leaves UInt8s
as UInt8s - way less conversion of types.
Adding:
function bench()
in_ = hex2bytes("00112233445566778899aabbccddeeff")
key = hex2bytes("000102030405060708090a0b0c0d0e0f")
prm = aes_get_cipher_params(key)
keys = gen_key_schedule(key, prm)
keyr = reshape(keys, 4, 4, 11)
inr = reshape(in_, 4, 4)
@time for i in 1:100
rijndael(inr, keyr, prm)
end
end
bench()
bench()
bench()
On 0.4-RC1:
0.000644 seconds (4.10 k allocations: 285.938 KB)
0.000638 seconds (4.10 k allocations: 285.938 KB)
On 0.3.11
elapsed time: 0.001574148 seconds (297600 bytes allocated)
elapsed time: 0.001740806 seconds (297600 bytes allocated)
The `key_block[:,:,i]` lines are probably responsible for a lot of the
allocation - that allocates memory every time.
On Saturday, September 12, 2015 at 1:41:41 AM UTC-4, Jeffrey Sarnoff wrote:
>
> (corrected post)
>
> Wrapping a value in a type gets you a passable, mutable value via the
> field name:
>
> julia> type PassMe
> b::UInt8
> end
> julia> useMe = PassMe(0x03); useMe.b
> 0x03
> julia> function changeMe!(pass::PassMe, newval::UInt8)
> pass.b = newval
> true # pass is not returned
> end;
> julia> changeMe!(useMe, 0xff); useMe.b
> 0xff
>
>
> On Saturday, September 12, 2015 at 12:16:24 AM UTC-4, Yichao Yu wrote:
>>
>> On Fri, Sep 11, 2015 at 11:56 PM, Jeffrey Sarnoff
>> <[email protected]> wrote:
>> > The way to get a non-global, global-like value is to wrap the whole
>> thing
>> > as a module:
>> > module aes
>> > exports rijnadael
>> >
>> > not_a_global = true
>> > # .. functions use not_a_global ...
>> > end # module
>>
>> Just to be clear, `not_a_global` is a global.
>>
>> >
>> > Wrapping a value in a type gets you a passable, mutable value via the
>> field
>> > name:
>> >
>> > julia> type PassMe
>> > b::UInt8
>> > end
>> > julia> useMe = PassMe(0x03); useMe.b
>> > 0x03
>> > julia> function changeMe!(pass::PassMe, newval::UInt8)
>> > pass.b = newval
>> > true # pass is not returned
>> > end;
>> > julia> changeMe!(useMe, 0xff); useMe.b
>> > 0xff
>> >
>> >
>> > On Friday, September 11, 2015 at 11:34:58 PM UTC-4, Stefan Karpinski
>> wrote:
>> >>
>> >> Oh yeah, I didn't notice that – having those be abstractly typed is
>> going
>> >> to a trainwreck for performance. It should be possible to eliminate
>> all
>> >> allocation from this kind of code.
>> >>
>> >> On Fri, Sep 11, 2015 at 10:46 PM, Tom Breloff <[email protected]>
>> wrote:
>> >>>
>> >>> I looked at your code for a grand total of 20 seconds, but the one
>> thing
>> >>> I would check is whether using concrete types (Int) in your
>> immutable, or
>> >>> maybe making it parametric:
>> >>>
>> >>> immutable AES_cipher_params{T<:Unsigned}
>> >>> bits::T # Cipher key length, bits
>> >>> nk::T # Number of 32-bit words, cipher key
>> >>> nb::T # Number of columns in State
>> >>> nr::T # Number of rounds
>> >>> block_size::T # byte length
>> >>> block_x::T # block dimensions X
>> >>> block_y::T # block dimensions Y
>> >>> end
>> >>>
>> >>> It's possible the compiler can't infer enough types because of this?
>> >>> Just a guess...
>> >>>
>> >>> On Fri, Sep 11, 2015 at 10:38 PM, Corey Moncure <[email protected]>
>>
>> >>> wrote:
>> >>>>
>> >>>> The creator has descended from the shining heavens and responded to
>> my
>> >>>> post. Cool.
>> >>>>
>> >>>> Here are some stats naively gathered from taking time() before and
>> after
>> >>>> the run and subtracting. Second run is with gc_disable(). A 1.26x
>> speedup
>> >>>> is noted.
>> >>>>
>> >>>> elapsed (s): 0.0916
>> >>>> Throughput, KB/s: 1705.97
>> >>>> Average time (μs) per iteration: 0.0
>> >>>> Estimated cycles / iteration @ 4.0 GHz: 36636.0
>> >>>>
>> >>>> elapsed (s): 0.0727
>> >>>> Throughput, KB/s: 2149.6
>> >>>> Average time (μs) per iteration: 7.2688
>> >>>> Estimated cycles / iteration @ 4.0 GHz: 29075.0
>> >>>>
>> >>>>
>> >>>> I'd like to know how to minimize the effect of the garbage collector
>> and
>> >>>> allocations. The algorithm is a tight loop with a handful of tiny
>> >>>> functions, none of which ought to require much allocation. A few
>> variables
>> >>>> for placeholder data are inevitable. But I have read the warnings
>> >>>> discouraging the use of global state. What is the Julia way to
>> allocate a
>> >>>> few bytes of scratch memory that can be accessed within the scope
>> of, say,
>> >>>> apply_ECB_mode!() without having to be reallocated each time
>> gf_mult() or
>> >>>> mix_columns! are called?
>> >>>>
>> >>>> Also, can a function be forced to mutate variables passed to it,
>> >>>> eliminating a redundant assignment? Julia seems happy to do this
>> with
>> >>>> Arrays but not unit primitives. Pass by reference (pointer)? Does
>> the LLVM
>> >>>> appreciate this?
>> >>>>
>> >>>> On Friday, September 11, 2015 at 6:15:52 PM UTC-4, Stefan Karpinski
>> >>>> wrote:
>> >>>>>
>> >>>>> There's nothing obviously glaring here. I would definitely
>> recommend
>> >>>>> using the built-in profiler to see where it's spending its time.
>> There may
>> >>>>> be subtle type instabilities or some other non-obvious issue. You
>> definitely
>> >>>>> ought to be able to get within striking distance of similar C code,
>> which
>> >>>>> should be in the expected 4-10x slower than hand-coded assembly.
>> >>>>>
>> >>>>> On Fri, Sep 11, 2015 at 5:10 PM, Corey Moncure <[email protected]>
>>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> https://github.com/cmoncure/crypto/blob/master/aes.jl
>> >>>>>>
>> >>>>>> In the process of learning Julia (and crypto) I implemented the
>> >>>>>> Rijndael block cipher and inverse cipher. I tried to write
>> idiomatic yet
>> >>>>>> concise code, but the performance is not very desirable. On my
>> machine
>> >>>>>> (i5-2500k @ 4.0 Ghz) the throughput is piddling, on the order of
>> 10e6
>> >>>>>> bytes/sec, and memory allocation is at 3056 bytes / block, which I
>> have not
>> >>>>>> been able to cut down any further.
>> >>>>>>
>> >>>>>> Obviously I do not intend to compete with hand-tuned assembler
>> >>>>>> routines that heavily exploit SIMD and pre-computed tables, but my
>> >>>>>> completely unfounded gut feeling is that given the right input,
>> Julia should
>> >>>>>> be able to approach within a factor of 4-10 without such
>> optimizations.
>> >>>>>> Currently this routine is within a factor of 1000.
>> >>>>>>
>> >>>>>> Any Julia experts out there willing to take a peek at the code and
>> >>>>>> offer some tips for idiomatic (i.e. within the framework of Julia
>> syntax and
>> >>>>>> style) optimizations?
>> >>>>>>
>> >>>>>> In the course of doing this I have run into several gripes with
>> Julia,
>> >>>>>> particularly some of the built-in functions which are often
>> confusing or
>> >>>>>> contradictory by virtue of the type declarations of certain
>> methods (or lack
>> >>>>>> of needed ones). For instance, Julia does not support negative
>> indexing of
>> >>>>>> arrays... so then why do so many functions on arrays take only
>> signed
>> >>>>>> integer types for dimensions? To the noobie it seems like an
>> obvious choice
>> >>>>>> to type data holding the calculation of matrix dimensions or
>> indices as
>> >>>>>> unsigned integers, given that the language does not support
>> negative
>> >>>>>> indexing. Yet this fails unexpectedly in many built-ins such as
>> sub().
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>
>> >>
>> >
>>
>