________________________________________
> De : David Eccles (gringer) [david.ecc...@mpi-muenster.mpg.de]
> Date d'envoi : 29 juin 2011 05:48
> À : Sébastien Boisvert
> Cc : denovoassembler-users@lists.sourceforge.net
> Objet : Re: Kmer format -- storing colour-space
> 
> On Tue, 28 Jun 2011 19:49:42 David Eccles (gringer) wrote:
>> I need to go a bit deeper and replicate this kind of format for Kmers
>> for it to be really useful, but I was able to fit things in this far
>> without changing too much other code.
> 
> This will probably be quite an over-reaching change. To conserve
> memory/space, I think it would be reasonable to put the colour-space and
> first base flags into the m_u64 of the Kmers, something like this:
>

Color-space is not necessary I think,

m_parameters->getColorSpaceMode does that already.

 
> 01 23 456789012345678901...
> CK FB  1 2 3 4 5 6 7 8 9
> 
> I suggest keeping things at a 2-bit boundary because it makes working
> out base-space / colour-space positions a little easier.
> 
> Bit 0: Colour-space flag (1 -- kmer in colour space, 0 otherwise)
> Bit 1: First-base known flag (1 -- first base is known, 0 otherwise)
> [always set to 1 if bit 0 is 0 -- i.e. in base-space]
> 
> Bit 2-3:
> bit 1 == 1:
> First-base  00/01/10/11 -> A/C/G/T [as usual]
> otherwise:
> ??? possibly checksum (modulo 4 sum of 2-bit chunks)
>

In my opinion, the color-space flag is not necessary.

So we need 3 bits: 

1 bit: is the first base known ?
1 bit for padding
2 bits: if so, what is the first base ?

I agree with you overall, but I think it would be easier to 
put these informations in the last 4 bits of a Kmer (a k-mer being an array of 
uint64_t).

I like your idea because it does not change Read.h


> Bit 4 onwards:
> sequence in colour-space, or remaining sequence in base-space
> 
> Adding this will use 2 extra bits, making the max kmer length for one
> 64-bit value 31 bases:

4 bits, not 2 bits ?

> #define KMER_REQUIRED_BITS (2*MAXKMERLENGTH+2)
> 
> Note that the first base is stored in bits 2-3, so that increases the
> kmer size by 1 for base-space sequences, making the effective length for
> a given Kmer the same in both base-space and colour-space.
> 

Yes, if these bits are at the beginning.

But you need to change some code before claiming a victory.

> When hashing, bits 1,2,3 should not be considered, because they could
> change over the course of the search / assembly process.
>

Again, storing these 4 bits at the end and increasing the required bits by 4 
does the job I think
and is easier to implement.
 
> for storing/unpacking the kmer in base-space, just change increment the
> initial bit location by 1 (i.e. i=0 -> i=1).
> 

Well, I am not working on color-space right now. 



> For checking forward sequences, equality for two sequences in base-space
> compares bits 2 onwards. Equality for two sequences in colour-space
> compares bits 4 onwards, then (if a match is found), check bit 1. If bit
> 1 is the same in both sequences, declare mismatch if bits 2-3 differ,
> otherwise match. If bit 1 is different, declare a match, then copy over
> the first base from the sequence which has bit 1 set.
> 
> For comparing base-space against colour-space, first check to see if the
> colour-space sequence has a known first-base, report a mismatch if the
> first base is known and different from the base-space sequence.
> Otherwise, the base-space sequence (including first base) needs to be
> converted to colour-space, then compared. If you're doing that anyway,
> it might make sense to store the converted sequence to make subsequent
> comparisons a bit quicker (either as well as the original, or by
> deleting the original base-space sequence). There could be a compare()
> function that returns the converted (and matching) colour-space
> sequence, otherwise returns an invalid packed sequence (e.g. 00...00, or
> some other number with a checksum mismatch).
> 
> Given that kmer.cpp is fairly small, I should be able to manage
> implementing these changes somewhat quickly. However, if any other
> classes assume a particular format for the Kmers (e.g. the Read class),
> then those will need to be changed as well (ideally so that they no
> longer expect a particular format). If a checksum is used, and it is
> enforced that ^00 can't happen in a true structure, it should be
> possible to identify if any Kmer-modifying classes have this assumption.
> 
> Hope this helps,
>


 
> - David Eccles (gringer)
> 

                                                     Sébastien

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to