On Monday, 22 May 2017 at 08:58:24 UTC, biocyberman wrote:
@Nicolas Wilson: Your explanation of the enum is clear and very helpful. I can recall to the same technique used in kh_hash in samtools and the associated. With that said, the chars enum is only to 'T' (85) elements.

The reason for having only 85 elements in the array was pure laziness - the problem description seems to forbid non-ACGT letters, so I saw no reason to write more code to handle them. :p My code will crash if the input string contains lower-case letters, Zs, or any other letter beyond 'T' (or read random bits of memory if you're lucky).

@ag0aep6g
You fell into a trap there. The value is calculated at compile time, but it has >copy/paste-like behavior. That is, whenever you use `chars`, the code behaves as if you >typed out the array literal. That means, the whole array is re-created on every iteration.

Use `static immutable` instead. It still forces compile-time calculation, but it doesn't > have copy/paste behavior. Speeds up revComp3 a lot.

With 'iteration' here you mean running lifetime of the function, or in other words, each one of the 10_000 cycles in the benchmark?

Could you provide some more reading for what you are telling here? I can only guess it is intrinsic behavior of an 'enum'.

ag0aep6g is absolutely correct in his observation, and the resulting code is basically this:

string revComp3(string bps) {
    const N = bps.length;
    static immutable chars_saved = [Repeat!('A'-'\0', '\0'), 'T',
                Repeat!('C'-'A'-1, '\0'), 'G',
                Repeat!('G'-'C'-1, '\0'), 'C',
                Repeat!('T'-'G'-1, '\0'), 'A'];
    char[] result = new char[N];
    for (int i = 0; i < N; ++i) {
        auto chars = chars_saved.dup; // Bad stuff happens here
        result[i] = chars[bps[N-i-1]];
    }
    return result.assumeUnique;
}

As we can see, it copies the entire array for every character in the input string. That's basically an allocation and a memcpy in the innermost, hottest loop. Roughly as bad as it gets. (yup, that's 20 million allocations)

As for why this happens, enum can be thought of as the analog of C's #define - the compiler precalculates the data to fill the array, and then copies that into the source code every time it's used.

Reply via email to