> Also, if you happen to work with color space data, do you know a way to
> convert color-space contigs into nucleotide-space contigs ?

As mentioned by others on the seqanswers thread, you need to either record
the first base, or convert to a normalised colour space format. Because
you're ignoring quality values in assembly, there's no need to throw out the
first base or the first colour-space read.

Storing reads together with their first base would be a good idea:

Struct csRead{
char firstBase // 'N' if not known
string csSequence
}

or as a kmer:
Struct csKmer{
char firstBase
char[kmer_size] csSequence
}

You mentioned yourself that there's no need to store the reverse complement
when in colour-space. To get the reverse complement in base space, you
reverse complement the first base, convert to base space, then reverse the
sequence.

The matching of sequences should be done in colour-space. Unfortunately (or
fortunately), for every colour-space sequence there are four possible
sequences that can match to it (8 if you include reverse complement). For
long enough high-entropy sequences, this shouldn't matter, but it's worth
considering when choosing k-mer size:

A3200233 -> ATCCCTAT (ATAGGGAT); C3200233 -> CGAAAGCG (CGCTTTCG); G3200233 ->
GCTTTCGC (GCGAAAGC); T3200233 -> TAGGGATA (TATCCCTA)

[apologies if I got those wrong... I did a manual conversion]

If there is a good chance of a match between two reads, and one read has an
unknown first base, then you can infer that base from the other read.

If you want to mix colour-space and non-cs reads (as would fit in with the
"combine every technology" spirit of Ray), it makes sense to store everything
in this format (i.e. the internal representation of all sequences is in
colour-space). The 'unknown first base' problem can be alleviated somewhat by
storing the non-colour space reads into the graph first. Using an internal
representation of colour-space would also halve memory requirements, because
there's no longer a need to explicitly store the reverse complement sequence
(just make sure the match function matches in both forward and reverse
directions in colour space).

For reading out into contigs, if output is desired in base-space, only
produce a contig if the base-space sequence is known. As you read through the
graph that makes up each contig, you'll get a better idea of what the bases
should be (i.e. high coverage with a particular initial base, or many
different nodes that are consistent with a particular base sequence).

It may also be useful for an option to output as colour-space, just in case
someone want's to do further processing and error correction.

-- David Eccles (gringer)

------------------------------------------------------------------------------
Simplify data backup and recovery for your virtual environment with vRanger.
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Data protection magic?
Nope - It's vRanger. Get your free trial download today.
http://p.sf.net/sfu/quest-sfdev2dev
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to