On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
[...]
Question about your implementation: you assume the input may
contain newlines, but don't handle any other non-ACGT
characters. The problem definition states 'DNA string' and the
sample dataset contains no non-ACGT chars. Is this an
oversight my part or yours, or did you just decide to support
more than the problem requires?
[...]
Firstly, thank you for showing me various solutions, and even
cool benchmark code. To answer you questions: Yes I assume the
input file would realistically contain newlines, even though
the problem does not care about them. I also thought about
non-CATG bases, but haven't taken care of those cases. In
reality we should deal with at least ambiguous bases (N).
I ran your code and also see that switch is faster than AA
(i.e. revComp0 is the fastest). And Stefan is right about this.
Some follow up questions:
1. Why do we need to use assumeUnique in 'revComp0' and
'revComp3'?
Because `char[] result = new char[N];` is not a string (a.k.a.
immutable(char)[]).
But because it was created from the GC in this function we know
that it is safe to assume that is a string.
2. What is going on with the trick of making chars enum like
that in 'revComp3'?
What revComp3 is doing is effectively creating a table for each
possible value of char that matches the behaviour of the switch.
it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```
Other miscellaneous comments:
If you haven't already checkout
[BioD](https://github.com/biod/BioD), for most (all?) your
bioinformatics needs.
If you're trying to be fast you probably don't want to use string
for internal calculations as it is very entropy non-optimal (2
bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding).
I would have at least 2 "Dictionaries": one the standard
nucleotides (ACGT) and another for your ambiguous representations
(UNRYBDHVMKSW-) and the standard nucleotides, to get a better
information density. If you're doing anything with protein
sequences then you should use a translation table anyway as the
DNA -> amino acid mapping changes between species/organelle
(mt|cp|n)DNA.