Re: Code improvement for DNA reverse complement?

Nicholas Wilson via Digitalmars-d-learn Fri, 19 May 2017 06:41:05 -0700

On Friday, 19 May 2017 at 12:21:10 UTC, biocyberman wrote:

On Friday, 19 May 2017 at 09:17:04 UTC, Biotronic wrote:
On Friday, 19 May 2017 at 07:29:44 UTC, biocyberman wrote:
[...]
Question about your implementation: you assume the input maycontain newlines, but don't handle any other non-ACGTcharacters. The problem definition states 'DNA string' and thesample dataset contains no non-ACGT chars. Is this anoversight my part or yours, or did you just decide to supportmore than the problem requires?
[...]
Firstly, thank you for showing me various solutions, and evencool benchmark code. To answer you questions: Yes I assume theinput file would realistically contain newlines, even thoughthe problem does not care about them. I also thought aboutnon-CATG bases, but haven't taken care of those cases. Inreality we should deal with at least ambiguous bases (N).
I ran your code and also see that switch is faster than AA(i.e. revComp0 is the fastest). And Stefan is right about this.
Some follow up questions:
1. Why do we need to use assumeUnique in 'revComp0' and'revComp3'?

Because `char[] result = new char[N];` is not a string (a.k.a.immutable(char)[]).But because it was created from the GC in this function we knowthat it is safe to assume that is a string.

2. What is going on with the trick of making chars enum likethat in 'revComp3'?

What revComp3 is doing is effectively creating a table for eachpossible value of char that matches the behaviour of the switch.

it could also be rewritten as
```
char[256] chars; // implicitly memset to '\0'
chars['A'] = 'T';
chars['C'] = 'G';
chars['G'] = 'C';
chars['T'] = 'A';
```

Other miscellaneous comments:

If you haven't already checkout[BioD](https://github.com/biod/BioD), for most (all?) yourbioinformatics needs.

If you're trying to be fast you probably don't want to use stringfor internal calculations as it is very entropy non-optimal (2bits out of 8 for ACGT, 4 out of 8 for an ambiguous encoding).I would have at least 2 "Dictionaries": one the standardnucleotides (ACGT) and another for your ambiguous representations(UNRYBDHVMKSW-) and the standard nucleotides, to get a betterinformation density. If you're doing anything with proteinsequences then you should use a translation table anyway as theDNA -> amino acid mapping changes between species/organelle(mt|cp|n)DNA.

Re: Code improvement for DNA reverse complement?

Reply via email to