________________________________________ > De : David Eccles (gringer) [david.ecc...@mpi-muenster.mpg.de] > Date d'envoi : 29 juin 2011 05:48 > À : Sébastien Boisvert > Cc : denovoassembler-users@lists.sourceforge.net > Objet : Re: Kmer format -- storing colour-space > > On Tue, 28 Jun 2011 19:49:42 David Eccles (gringer) wrote: >> I need to go a bit deeper and replicate this kind of format for Kmers >> for it to be really useful, but I was able to fit things in this far >> without changing too much other code. > > This will probably be quite an over-reaching change. To conserve > memory/space, I think it would be reasonable to put the colour-space and > first base flags into the m_u64 of the Kmers, something like this: >
Color-space is not necessary I think, m_parameters->getColorSpaceMode does that already. > 01 23 456789012345678901... > CK FB 1 2 3 4 5 6 7 8 9 > > I suggest keeping things at a 2-bit boundary because it makes working > out base-space / colour-space positions a little easier. > > Bit 0: Colour-space flag (1 -- kmer in colour space, 0 otherwise) > Bit 1: First-base known flag (1 -- first base is known, 0 otherwise) > [always set to 1 if bit 0 is 0 -- i.e. in base-space] > > Bit 2-3: > bit 1 == 1: > First-base 00/01/10/11 -> A/C/G/T [as usual] > otherwise: > ??? possibly checksum (modulo 4 sum of 2-bit chunks) > In my opinion, the color-space flag is not necessary. So we need 3 bits: 1 bit: is the first base known ? 1 bit for padding 2 bits: if so, what is the first base ? I agree with you overall, but I think it would be easier to put these informations in the last 4 bits of a Kmer (a k-mer being an array of uint64_t). I like your idea because it does not change Read.h > Bit 4 onwards: > sequence in colour-space, or remaining sequence in base-space > > Adding this will use 2 extra bits, making the max kmer length for one > 64-bit value 31 bases: 4 bits, not 2 bits ? > #define KMER_REQUIRED_BITS (2*MAXKMERLENGTH+2) > > Note that the first base is stored in bits 2-3, so that increases the > kmer size by 1 for base-space sequences, making the effective length for > a given Kmer the same in both base-space and colour-space. > Yes, if these bits are at the beginning. But you need to change some code before claiming a victory. > When hashing, bits 1,2,3 should not be considered, because they could > change over the course of the search / assembly process. > Again, storing these 4 bits at the end and increasing the required bits by 4 does the job I think and is easier to implement. > for storing/unpacking the kmer in base-space, just change increment the > initial bit location by 1 (i.e. i=0 -> i=1). > Well, I am not working on color-space right now. > For checking forward sequences, equality for two sequences in base-space > compares bits 2 onwards. Equality for two sequences in colour-space > compares bits 4 onwards, then (if a match is found), check bit 1. If bit > 1 is the same in both sequences, declare mismatch if bits 2-3 differ, > otherwise match. If bit 1 is different, declare a match, then copy over > the first base from the sequence which has bit 1 set. > > For comparing base-space against colour-space, first check to see if the > colour-space sequence has a known first-base, report a mismatch if the > first base is known and different from the base-space sequence. > Otherwise, the base-space sequence (including first base) needs to be > converted to colour-space, then compared. If you're doing that anyway, > it might make sense to store the converted sequence to make subsequent > comparisons a bit quicker (either as well as the original, or by > deleting the original base-space sequence). There could be a compare() > function that returns the converted (and matching) colour-space > sequence, otherwise returns an invalid packed sequence (e.g. 00...00, or > some other number with a checksum mismatch). > > Given that kmer.cpp is fairly small, I should be able to manage > implementing these changes somewhat quickly. However, if any other > classes assume a particular format for the Kmers (e.g. the Read class), > then those will need to be changed as well (ideally so that they no > longer expect a particular format). If a checksum is used, and it is > enforced that ^00 can't happen in a true structure, it should be > possible to identify if any Kmer-modifying classes have this assumption. > > Hope this helps, > > - David Eccles (gringer) > Sébastien ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users