On Tue, 28 Jun 2011 19:49:42 David Eccles (gringer) wrote:
> I need to go a bit deeper and replicate this kind of format for Kmers
> for it to be really useful, but I was able to fit things in this far
> without changing too much other code.

This will probably be quite an over-reaching change. To conserve 
memory/space, I think it would be reasonable to put the colour-space and 
first base flags into the m_u64 of the Kmers, something like this:

01 23 456789012345678901...
CK FB  1 2 3 4 5 6 7 8 9

I suggest keeping things at a 2-bit boundary because it makes working 
out base-space / colour-space positions a little easier.

Bit 0: Colour-space flag (1 -- kmer in colour space, 0 otherwise)
Bit 1: First-base known flag (1 -- first base is known, 0 otherwise)
        [always set to 1 if bit 0 is 0 -- i.e. in base-space]

Bit 2-3:
  bit 1 == 1:
    First-base  00/01/10/11 -> A/C/G/T [as usual]
  otherwise:
    ??? possibly checksum (modulo 4 sum of 2-bit chunks)

Bit 4 onwards:
  sequence in colour-space, or remaining sequence in base-space

Adding this will use 2 extra bits, making the max kmer length for one 
64-bit value 31 bases:
  #define KMER_REQUIRED_BITS (2*MAXKMERLENGTH+2)

Note that the first base is stored in bits 2-3, so that increases the 
kmer size by 1 for base-space sequences, making the effective length for 
a given Kmer the same in both base-space and colour-space.

When hashing, bits 1,2,3 should not be considered, because they could 
change over the course of the search / assembly process.

for storing/unpacking the kmer in base-space, just change increment the 
initial bit location by 1 (i.e. i=0 -> i=1).

For checking forward sequences, equality for two sequences in base-space 
compares bits 2 onwards. Equality for two sequences in colour-space 
compares bits 4 onwards, then (if a match is found), check bit 1. If bit 
1 is the same in both sequences, declare mismatch if bits 2-3 differ, 
otherwise match. If bit 1 is different, declare a match, then copy over 
the first base from the sequence which has bit 1 set.

For comparing base-space against colour-space, first check to see if the 
colour-space sequence has a known first-base, report a mismatch if the 
first base is known and different from the base-space sequence. 
Otherwise, the base-space sequence (including first base) needs to be 
converted to colour-space, then compared. If you're doing that anyway, 
it might make sense to store the converted sequence to make subsequent 
comparisons a bit quicker (either as well as the original, or by 
deleting the original base-space sequence). There could be a compare() 
function that returns the converted (and matching) colour-space 
sequence, otherwise returns an invalid packed sequence (e.g. 00...00, or 
some other number with a checksum mismatch).

Given that kmer.cpp is fairly small, I should be able to manage 
implementing these changes somewhat quickly. However, if any other 
classes assume a particular format for the Kmers (e.g. the Read class), 
then those will need to be changed as well (ideally so that they no 
longer expect a particular format). If a checksum is used, and it is 
enforced that ^00 can't happen in a true structure, it should be 
possible to identify if any Kmer-modifying classes have this assumption.

Hope this helps,

- David Eccles (gringer)

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Reply via email to