Hello
> ________________________________________ > De : Eccles, David [david.ecc...@mpi-muenster.mpg.de], Sorry for the delay, I moved to Toronto 3 days ago. > Date d'envoi : 27 juin 2011 17:50 > À : Sébastien Boisvert > Cc : denovoassembler-users@lists.sourceforge.net > Objet : AW: colour space reads > > Von: Sébastien Boisvert [mailto:sebastien.boisver...@ulaval.ca] >>> You mentioned yourself that there's no need to store the reverse > complement >>> when in colour-space. To get the reverse complement in base space, you >>> reverse complement the first base, convert to base space, then reverse the >>> sequence. >> Complement the first base and reverse the color -- this is the recipe to >> "reverse-complement" a color-space read. >> I think I am starting to get it. > > Be careful with this. Order of processing matters a lot with colour space. > You need to reverse the resulting *base-space* sequence, rather than > reversing the colour space sequence then working it out in base space. (e.g. > the reverse complement of A3200233 is reverse(T3200233), not T3320023) > using the code described in this figure: http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002 decode(A3200233) = ATCCCTAT decode(reverse(T3200233)) = decode(3320023T) = ATAGGGAT Now I understand. >>> If there is a good chance of a match between two reads, and one read has > an >>> unknown first base, then you can infer that base from the other read. >> Yes, but keep in mind that Ray never computes pairwise similarity. > > Sure. In the scenario I described, both sequences would have exactly the same > colour-space representation (excluding first base) -- no pairwise differences > necessary. The only difference is that one can be converted unambiguously to > a base-space sequence (known first base), and the other has up to 4 > base-space representations (unknown first base). > >> Like in Velvet, Ray uses 2 bits per symbol. > > And also a flag for whether or not the kmer is in colour-space (or all kmers > in colour space), I presume. For each kmer (assuming you want to be able to > output in base-space), Ray will also need to record a first base, preferably > in a separate structure, but it could just be the first 2-bit symbol in the > sequence. > Actually, m_parameters->getColorSpaceMode() does that. >> a path can obviously start in the middle of a read -- thus in that case >> the first base would remain unknown. (right?) > > From each read, you can generate putative first bases for any subsequence of > an uninterrupted <first base>[0123]+ sequence. This requires converting the > sequence to base space, and inserting the converted base at the appropriate > position. I'll try to demonstrate this starting with a colour-space sequence: > > A2112322311010133121320003202203201302321 > Quite interesting, but this assume that reads are relatively error-free, right ? > This has starting base A, complementary transitions have colour 3, > non-complementary are 1,2 depending on how far away they are in the alphabet > [just FYI, that's how I remember it]: > > AGTGATCTACAACCATACTGCTTTTAGGAGGCTTGCCTAGT [or something like that -- > hopefully I converted it correctly] > > If I start with the colour-space sequence, I can work out the 'starting base' > at any position by converting to base-space. For example, before the string > of 3 0s, you can insert a T: > <A>211232231101013312132<T>0003202203201302321 > > I'll try working through a scenario. Let's say I want the sequence split up > into groups of 10-mers: > > 2112322311 0101331213 2000320220 3201302321 > > I know the first base for the first group: > > <A>2112322311 0101331213 2000320220 3201302321 > > I can convert that first group to base space, and the last base of that > converted group is the first base for the next group: > > (<A>2112322311 / AGTGATCTAC) <C>0101331213 2000320220 3201302321 > > and so on: > > <A>2112322311 <C>0101331213 <C>2000320220 <G>3201302321 > > If there's a misread somewhere, any sequences past the misread will have > ambiguous colour-space -> base-space translations: > Yes, and this is bad. > <A>2112322311 <C>01013X1213 <N>2000320220 <N>3201302321 > > The problem is that for a sufficiently large dataset (or error-containing > dataset), you'll get disagreements about the starting base for a given > sequence. If Ray were to record the counts for each observed starting base, > it might be possible to reduce this error (e.g. pick the most frequently > occurring starting base), bearing in mind that the starting base for sequence > closer to the start of a read will be more reliable than the calculated > starting bases at the end of a read. > Yes, but their can be multiple correct nucleotides before any color-space k-mer it the latter is a repeated k-mer in color-space genome. > Hope this helps, > Pretty much did ! Especially for decode(reverse(T3200233)) = decode(3320023T) = ATAGGGAT > David Eccles (gringer) > p.s.: To decode color-space manually, I use the automate in panel B of http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002 Sébastien ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users