[Denovoassembler-users] RE : colour space reads

Sébastien Boisvert Sun, 03 Jul 2011 06:18:57 -0700

Hello


> ________________________________________
> De : Eccles, David [david.ecc...@mpi-muenster.mpg.de],

Sorry for the delay, I moved to Toronto 3 days ago.


> Date d'envoi : 27 juin 2011 17:50
> À : Sébastien Boisvert
> Cc : denovoassembler-users@lists.sourceforge.net
> Objet : AW: colour space reads
> 
> Von: Sébastien Boisvert [mailto:sebastien.boisver...@ulaval.ca]
>>> You mentioned yourself that there's no need to store the reverse
> complement
>>> when in colour-space. To get the reverse complement in base space, you
>>> reverse complement the first base, convert to base space, then reverse the
>>> sequence.
>> Complement the first base and reverse the color -- this is the recipe to
>> "reverse-complement" a color-space read.
>> I think I am starting to get it.
> 
> Be careful with this. Order of processing matters a lot with colour space.
> You need to reverse the resulting *base-space* sequence, rather than
> reversing the colour space sequence then working it out in base space. (e.g.
> the reverse complement of A3200233 is reverse(T3200233), not T3320023)
>

using the code described in this figure:
http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002


decode(A3200233) = ATCCCTAT
 
decode(reverse(T3200233)) = decode(3320023T) = ATAGGGAT

Now I understand.

>>> If there is a good chance of a match between two reads, and one read has
> an
>>> unknown first base, then you can infer that base from the other read.
>> Yes, but keep in mind that Ray never computes pairwise similarity.
> 
> Sure. In the scenario I described, both sequences would have exactly the same
> colour-space representation (excluding first base) -- no pairwise differences
> necessary. The only difference is that one can be converted unambiguously to
> a base-space sequence (known first base), and the other has up to 4
> base-space representations (unknown first base).
> 
>> Like in Velvet, Ray uses 2 bits per symbol.
> 
> And also a flag for whether or not the kmer is in colour-space (or all kmers
> in colour space), I presume. For each kmer (assuming you want to be able to
> output in base-space), Ray will also need to record a first base, preferably
> in a separate structure, but it could just be the first 2-bit symbol in the
> sequence.
> 

Actually, m_parameters->getColorSpaceMode() does that.

>> a path can obviously start in the middle of a read -- thus in that case
>> the first base would remain unknown. (right?)
> 
> From each read, you can generate putative first bases for any subsequence of
> an uninterrupted <first base>[0123]+ sequence. This requires converting the
> sequence to base space, and inserting the converted base at the appropriate
> position. I'll try to demonstrate this starting with a colour-space sequence:
> 
> A2112322311010133121320003202203201302321
> 

Quite interesting, but this assume that reads are relatively error-free, right ?

> This has starting base A, complementary transitions have colour 3,
> non-complementary are 1,2 depending on how far away they are in the alphabet
> [just FYI, that's how I remember it]:
> 
> AGTGATCTACAACCATACTGCTTTTAGGAGGCTTGCCTAGT [or something like that --
> hopefully I converted it correctly]
> 
> If I start with the colour-space sequence, I can work out the 'starting base'
> at any position by converting to base-space. For example, before the string
> of 3 0s, you can insert a T:
> <A>211232231101013312132<T>0003202203201302321
> 
> I'll try working through a scenario. Let's say I want the sequence split up
> into groups of 10-mers:
> 
> 2112322311  0101331213  2000320220  3201302321
> 
> I know the first base for the first group:
> 
> <A>2112322311  0101331213  2000320220  3201302321
> 
> I can convert that first group to base space, and the last base of that
> converted group is the first base for the next group:
> 
> (<A>2112322311 / AGTGATCTAC) <C>0101331213  2000320220  3201302321
> 
> and so on:
> 
> <A>2112322311 <C>0101331213 <C>2000320220 <G>3201302321
> 
> If there's a misread somewhere, any sequences past the misread will have
> ambiguous colour-space -> base-space translations:
> 

Yes, and this is bad.

> <A>2112322311 <C>01013X1213 <N>2000320220 <N>3201302321
> 
> The problem is that for a sufficiently large dataset (or error-containing
> dataset), you'll get disagreements about the starting base for a given
> sequence. If Ray were to record the counts for each observed starting base,
> it might be possible to reduce this error (e.g. pick the most frequently
> occurring starting base), bearing in mind that the starting base for sequence
> closer to the start of a read will be more reliable than the calculated
> starting bases at the end of a read.
> 

Yes, but their can be multiple correct nucleotides before any color-space k-mer 
it the latter
is a repeated k-mer in color-space genome.

> Hope this helps,
> 

Pretty much did ! 

Especially for 

decode(reverse(T3200233)) = decode(3320023T) = ATAGGGAT

> David Eccles (gringer)
> 

p.s.: To decode color-space manually, I use the automate in panel B of 
http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002


                             Sébastien
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

[Denovoassembler-users] RE : colour space reads

Reply via email to