On Monday, 22 May 2017 at 08:58:24 UTC, biocyberman wrote:
@Nicolas Wilson: Your explanation of the enum is clear and very
helpful. I can recall to the same technique used in kh_hash in
samtools and the associated. With that said, the chars enum is
only to 'T' (85) elements.
The reason for having only 85 elements in the array was pure
laziness - the problem description seems to forbid non-ACGT
letters, so I saw no reason to write more code to handle them. :p
My code will crash if the input string contains lower-case
letters, Zs, or any other letter beyond 'T' (or read random bits
of memory if you're lucky).
@ag0aep6g
You fell into a trap there. The value is calculated at compile
time, but it has >copy/paste-like behavior. That is, whenever
you use `chars`, the code behaves as if you >typed out the
array literal. That means, the whole array is re-created on
every iteration.
Use `static immutable` instead. It still forces compile-time
calculation, but it doesn't > have copy/paste behavior. Speeds
up revComp3 a lot.
With 'iteration' here you mean running lifetime of the
function, or in other words, each one of the 10_000 cycles in
the benchmark?
Could you provide some more reading for what you are telling
here? I can only guess it is intrinsic behavior of an 'enum'.
ag0aep6g is absolutely correct in his observation, and the
resulting code is basically this:
string revComp3(string bps) {
const N = bps.length;
static immutable chars_saved = [Repeat!('A'-'\0', '\0'), 'T',
Repeat!('C'-'A'-1, '\0'), 'G',
Repeat!('G'-'C'-1, '\0'), 'C',
Repeat!('T'-'G'-1, '\0'), 'A'];
char[] result = new char[N];
for (int i = 0; i < N; ++i) {
auto chars = chars_saved.dup; // Bad stuff happens here
result[i] = chars[bps[N-i-1]];
}
return result.assumeUnique;
}
As we can see, it copies the entire array for every character in
the input string. That's basically an allocation and a memcpy in
the innermost, hottest loop. Roughly as bad as it gets. (yup,
that's 20 million allocations)
As for why this happens, enum can be thought of as the analog of
C's #define - the compiler precalculates the data to fill the
array, and then copies that into the source code every time it's
used.