Re: [Denovoassembler-users] Denovoassembler-users Digest, Vol 10, Issue 8

Sébastien Boisvert Mon, 27 Jun 2011 05:43:47 -0700

On Thu, 2011-06-23 at 04:53 -0400, Eccles, David wrote:
> > Also, if you happen to work with color space data, do you know a way to
> > convert color-space contigs into nucleotide-space contigs ?
> 
> As mentioned by others on the seqanswers thread, you need to either record
> the first base, or convert to a normalised colour space format. Because
> you're ignoring quality values in assembly, there's no need to throw out the
> first base or the first colour-space read.
>


> Storing reads together with their first base would be a good idea:
> 
> Struct csRead{
> char firstBase // 'N' if not known
> string csSequence
> }
> 
> or as a kmer:
> Struct csKmer{
> char firstBase
> char[kmer_size] csSequence
> }
> 
> You mentioned yourself that there's no need to store the reverse complement
> when in colour-space. To get the reverse complement in base space, you
> reverse complement the first base, convert to base space, then reverse the
> sequence.
> 

Complement the first base and reverse the color -- this is the recipe to
"reverse-complement" a color-space read.

I think I am starting to get it.

> The matching of sequences should be done in colour-space. Unfortunately (or
> fortunately), for every colour-space sequence there are four possible
> sequences that can match to it (8 if you include reverse complement). For
> long enough high-entropy sequences, this shouldn't matter, but it's worth
> considering when choosing k-mer size:
> 

Ray presently can do assemblies of color-space reads in color-space with
alphabet {0,1,2,3}.

> A3200233 -> ATCCCTAT (ATAGGGAT); C3200233 -> CGAAAGCG (CGCTTTCG); G3200233 ->
> GCTTTCGC (GCGAAAGC); T3200233 -> TAGGGATA (TATCCCTA)
> 
> [apologies if I got those wrong... I did a manual conversion]
> 
> If there is a good chance of a match between two reads, and one read has an
> unknown first base, then you can infer that base from the other read.
> 

Yes, but keep in mind that Ray never computes pairwise similarity.



> If you want to mix colour-space and non-cs reads (as would fit in with the
> "combine every technology" spirit of Ray), it makes sense to store everything
> in this format (i.e. the internal representation of all sequences is in
> colour-space). 

I agree.

> The 'unknown first base' problem can be alleviated somewhat by
> storing the non-colour space reads into the graph first. Using an internal
> representation of colour-space would also halve memory requirements, because
> there's no longer a need to explicitly store the reverse complement sequence
> (just make sure the match function matches in both forward and reverse
> directions in colour space).

In Ray, the reverse-complement of any read is never stored.

Furthermore, for any pair of reverse-complement k-mers, only one (the
lowest) is stored.

Like in Velvet, Ray uses 2 bits per symbol.

> 
> For reading out into contigs, if output is desired in base-space, only
> produce a contig if the base-space sequence is known. As you read through the
> graph that makes up each contig, you'll get a better idea of what the bases
> should be (i.e. high coverage with a particular initial base, or many
> different nodes that are consistent with a particular base sequence).
> 

Well, this is where I think I don't get it at all.

Contigs *are* paths in the k-mer graph (with alphabet {0,1,2,3}). To
decode these paths, the first base-space symbol must be known. However,
a path can obviously start in the middle of a read -- thus in that case
the first base would remain unknown. (right?)

> It may also be useful for an option to output as colour-space, just in case
> someone want's to do further processing and error correction.
> 

Ray presently assembles csfasta files in color-space contigs.

I used the open dataset

http://solidsoftwaretools.com/gf/project/ecoli50x50/

(1 year old according to the web site -- 2010-05-19)


However, this dataset has a very high error rate !



                                   Sébastien


> -- David Eccles (gringer)
> 
> ------------------------------------------------------------------------------
> Simplify data backup and recovery for your virtual environment with vRanger.
> Installation's a snap, and flexible recovery options mean your data is safe,
> secure and there when you need it. Data protection magic?
> Nope - It's vRanger. Get your free trial download today.
> http://p.sf.net/sfu/quest-sfdev2dev
> _______________________________________________
> Denovoassembler-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/denovoassembler-users




------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Denovoassembler-users Digest, Vol 10, Issue 8

Reply via email to