On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:
On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
[...]

Question about your implementation: you assume the input may contain newlines, but don't handle any other non-ACGT characters. The problem definition states 'DNA string' and the sample dataset contains no non-ACGT chars. Is this an oversight my part or yours, or did you just decide to support more than the problem requires?

[...]

Firstly, thank you for showing me various solutions, and even cool benchmark code. To answer you questions: Yes I assume the input file would realistically contain newlines, even though the problem does not care about them. I also thought about non-CATG bases, but haven't taken care of those cases. In reality we should deal with at least ambiguous bases (N).

I ran your code and also see that switch is faster than AA (i.e. revComp0 is the fastest). And Stefan is right about this.

Some follow up questions:

1. Why do we need to use assumeUnique in 'revComp0' and 'revComp3'?


Because `char[] result = new char[N];` is not a string (a.k.a. immutable(char)[]). But because it was created from the GC in this function we know that it is safe to assume that is a string.

2. What is going on with the trick of making chars enum like that in 'revComp3'?

What revComp3 is doing is effectively creating a table for each possible value of char that matches the behaviour of the switch.
it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```

Other miscellaneous comments:

If you haven't already checkout [BioD](https://github.com/biod/BioD), for most (all?) your bioinformatics needs.

If you're trying to be fast you probably don't want to use string for internal calculations as it is very entropy non-optimal (2 bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding). I would have at least 2 "Dictionaries": one the standard nucleotides (ACGT) and another for your ambiguous representations (UNRYBDHVMKSW-) and the standard nucleotides, to get a better information density. If you're doing anything with protein sequences then you should use a translation table anyway as the DNA -> amino acid mapping changes between species/organelle (mt|cp|n)DNA.

Reply via email to