On Tue, 28 Jun 2011 19:49:42 David Eccles (gringer) wrote: > I need to go a bit deeper and replicate this kind of format for Kmers > for it to be really useful, but I was able to fit things in this far > without changing too much other code.
This will probably be quite an over-reaching change. To conserve memory/space, I think it would be reasonable to put the colour-space and first base flags into the m_u64 of the Kmers, something like this: 01 23 456789012345678901... CK FB 1 2 3 4 5 6 7 8 9 I suggest keeping things at a 2-bit boundary because it makes working out base-space / colour-space positions a little easier. Bit 0: Colour-space flag (1 -- kmer in colour space, 0 otherwise) Bit 1: First-base known flag (1 -- first base is known, 0 otherwise) [always set to 1 if bit 0 is 0 -- i.e. in base-space] Bit 2-3: bit 1 == 1: First-base 00/01/10/11 -> A/C/G/T [as usual] otherwise: ??? possibly checksum (modulo 4 sum of 2-bit chunks) Bit 4 onwards: sequence in colour-space, or remaining sequence in base-space Adding this will use 2 extra bits, making the max kmer length for one 64-bit value 31 bases: #define KMER_REQUIRED_BITS (2*MAXKMERLENGTH+2) Note that the first base is stored in bits 2-3, so that increases the kmer size by 1 for base-space sequences, making the effective length for a given Kmer the same in both base-space and colour-space. When hashing, bits 1,2,3 should not be considered, because they could change over the course of the search / assembly process. for storing/unpacking the kmer in base-space, just change increment the initial bit location by 1 (i.e. i=0 -> i=1). For checking forward sequences, equality for two sequences in base-space compares bits 2 onwards. Equality for two sequences in colour-space compares bits 4 onwards, then (if a match is found), check bit 1. If bit 1 is the same in both sequences, declare mismatch if bits 2-3 differ, otherwise match. If bit 1 is different, declare a match, then copy over the first base from the sequence which has bit 1 set. For comparing base-space against colour-space, first check to see if the colour-space sequence has a known first-base, report a mismatch if the first base is known and different from the base-space sequence. Otherwise, the base-space sequence (including first base) needs to be converted to colour-space, then compared. If you're doing that anyway, it might make sense to store the converted sequence to make subsequent comparisons a bit quicker (either as well as the original, or by deleting the original base-space sequence). There could be a compare() function that returns the converted (and matching) colour-space sequence, otherwise returns an invalid packed sequence (e.g. 00...00, or some other number with a checksum mismatch). Given that kmer.cpp is fairly small, I should be able to manage implementing these changes somewhat quickly. However, if any other classes assume a particular format for the Kmers (e.g. the Read class), then those will need to be changed as well (ideally so that they no longer expect a particular format). If a checksum is used, and it is enforced that ^00 can't happen in a true structure, it should be possible to identify if any Kmer-modifying classes have this assumption. Hope this helps, - David Eccles (gringer) ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users