On Thu, 2011-06-23 at 04:53 -0400, Eccles, David wrote:
> > Also, if you happen to work with color space data, do you know a way to
> > convert color-space contigs into nucleotide-space contigs ?
>
> As mentioned by others on the seqanswers thread, you need to either record
> the first base, or convert to a normalised colour space format. Because
> you're ignoring quality values in assembly, there's no need to throw out the
> first base or the first colour-space read.
>
> Storing reads together with their first base would be a good idea:
>
> Struct csRead{
> char firstBase // 'N' if not known
> string csSequence
> }
>
> or as a kmer:
> Struct csKmer{
> char firstBase
> char[kmer_size] csSequence
> }
>
> You mentioned yourself that there's no need to store the reverse complement
> when in colour-space. To get the reverse complement in base space, you
> reverse complement the first base, convert to base space, then reverse the
> sequence.
>
Complement the first base and reverse the color -- this is the recipe to
"reverse-complement" a color-space read.
I think I am starting to get it.
> The matching of sequences should be done in colour-space. Unfortunately (or
> fortunately), for every colour-space sequence there are four possible
> sequences that can match to it (8 if you include reverse complement). For
> long enough high-entropy sequences, this shouldn't matter, but it's worth
> considering when choosing k-mer size:
>
Ray presently can do assemblies of color-space reads in color-space with
alphabet {0,1,2,3}.
> A3200233 -> ATCCCTAT (ATAGGGAT); C3200233 -> CGAAAGCG (CGCTTTCG); G3200233 ->
> GCTTTCGC (GCGAAAGC); T3200233 -> TAGGGATA (TATCCCTA)
>
> [apologies if I got those wrong... I did a manual conversion]
>
> If there is a good chance of a match between two reads, and one read has an
> unknown first base, then you can infer that base from the other read.
>
Yes, but keep in mind that Ray never computes pairwise similarity.
> If you want to mix colour-space and non-cs reads (as would fit in with the
> "combine every technology" spirit of Ray), it makes sense to store everything
> in this format (i.e. the internal representation of all sequences is in
> colour-space).
I agree.
> The 'unknown first base' problem can be alleviated somewhat by
> storing the non-colour space reads into the graph first. Using an internal
> representation of colour-space would also halve memory requirements, because
> there's no longer a need to explicitly store the reverse complement sequence
> (just make sure the match function matches in both forward and reverse
> directions in colour space).
In Ray, the reverse-complement of any read is never stored.
Furthermore, for any pair of reverse-complement k-mers, only one (the
lowest) is stored.
Like in Velvet, Ray uses 2 bits per symbol.
>
> For reading out into contigs, if output is desired in base-space, only
> produce a contig if the base-space sequence is known. As you read through the
> graph that makes up each contig, you'll get a better idea of what the bases
> should be (i.e. high coverage with a particular initial base, or many
> different nodes that are consistent with a particular base sequence).
>
Well, this is where I think I don't get it at all.
Contigs *are* paths in the k-mer graph (with alphabet {0,1,2,3}). To
decode these paths, the first base-space symbol must be known. However,
a path can obviously start in the middle of a read -- thus in that case
the first base would remain unknown. (right?)
> It may also be useful for an option to output as colour-space, just in case
> someone want's to do further processing and error correction.
>
Ray presently assembles csfasta files in color-space contigs.
I used the open dataset
http://solidsoftwaretools.com/gf/project/ecoli50x50/
(1 year old according to the web site -- 2010-05-19)
However, this dataset has a very high error rate !
Sébastien
> -- David Eccles (gringer)
>
> ------------------------------------------------------------------------------
> Simplify data backup and recovery for your virtual environment with vRanger.
> Installation's a snap, and flexible recovery options mean your data is safe,
> secure and there when you need it. Data protection magic?
> Nope - It's vRanger. Get your free trial download today.
> http://p.sf.net/sfu/quest-sfdev2dev
> _______________________________________________
> Denovoassembler-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users