Re: [Denovoassembler-users] Kmer formats and colour-space / bit logic

Eccles, David Tue, 05 Jul 2011 04:01:06 -0700

> p.s.: To decode color-space manually, I use the automate in panel B
> of
>
http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/jou
rnal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002


Yes, thanks for that. I saw this link in the code, and it's now also
what I refer back to when thinking about colour-space transforms.

> Color-space is not necessary I think,
> m_parameters->getColorSpaceMode does that already.

But then you can't do nifty tricks like matching colour-space to 
base-space, which *can* be done by using a different k-mer format.

> So we need 3 bits:
> 1 bit: is the first base known ?
> 1 bit for padding
> 2 bits: if so, what is the first base ?

I realised that only 1 extra bit would be necessary, but given that
it's much easier to have 2-bit alignments, it made sense to have every
bit used (i.e. no padding bits). Hence the first base known +
colour-space flag. I also realised that there's still unused bits (for
an unknown first base in colour-space), so I've used those bits for a
checksum -- it makes the code a bit more complex, but has the benefit
of being able to identify data corruption.

> I agree with you overall, but I think it would be easier to put these
> informations in the last 4 bits of a Kmer (a k-mer being an array of
> uint64_t).

So in positions 60-63 when using 1 64-bit number, positions 125-128
when using 2, etc.? That means the location of the flags is less easy
to determine. I suppose you could put them always in positions 60-63
(i.e. at the end of the first array entry), but that's pretty much the
same as positions 0-3.

>> Adding this will use 2 extra bits, making the max kmer length for one
>> 64-bit value 31 bases
> 4 bits, not 2 bits

2 additional bits. Consider that a colour-space sequence with starting
base is the same length as the equivalent base-space sequence [with 
starting base]:

T 0122123
T TGAGTCG

The first base from both colour-space and base-space is stored in the
'first base' location in the modified k-mer. The k-mer length is 1 +
the number of bases stored in the remainder of the array.

> Routines for k-mers are in core/common_functions.h and
> structures/Kmer.h -- it would not call that 'all over the place' given
> the ~20k lines of Ray.

This is very much a personal style issue, but code is more readable
and easier for me to work with if class-specific things go into the
particular class they are working on -- this is why I've been changing
code locations as I work through the code. Certainly, if these k-mer
routines were only in common_functions/Kmer, it would be almost no
problem to change the k-mer format. However, Just doing a search in
the code for getU64/setU64 demonstrates that assumptions about a
particular k-mer format are make in a few other places in the code:

$ grep -rl '\(getU64\|setU64\)' code/* unit-tests/*
code/assembler/KmerAcademyBuilder.cpp
code/assembler/VerticesExtractor.cpp
code/assembler/FusionData.cpp
code/communication/MessageProcessor.cpp
code/core/common_functions.cpp
code/core/common_functions.h
code/structures/Kmer.cpp
code/structures/Kmer.h
unit-tests/test_uniform.cpp

> #define RAY_NUCLEOTIDE_A 0 /* ~00 == 11 */
> #define RAY_NUCLEOTIDE_C 1 /* ~01 == 10 */
> #define RAY_NUCLEOTIDE_G 2 /* ~10 == 01 */
> #define RAY_NUCLEOTIDE_T 3 /* ~11 == 00 */

Just as another warning, I tripped myself up when I was adjusting the
k-mer format because when I think arrays, I think lower number at the
left, but our number system has the lower number on the right. This
also applies to declaring binary numbers in c++ (i.e. 0bXXXXX), so
0b1100 is 12d, rather than 3d.

> you need to change some code before claiming a victory.

Almost done, I think. The unit tests are (finally) passing, but I
still need to implement extracting a colour-space k-mer from the
middle of a read, and add in my own unit-tests to make sure my 
additional code is doing the right thing. My current thoughts are 
that the first-base would only be filled in for the first 20 or 
so bases of a read (customisable, of course).

It would also be nice if comparison functions (.isEqual, etc.) filled
in unknown bases if tested against something that matched everywhere
else. That would probably mean not using operators, because they
assume that the LHS/RHS values aren't modified in the comparison
process -- this is a good assumption to hold onto, so perhaps instead
of using those functions, there should be something like a
'.equalsAndCopyKnownFirstBase(Kmer a)' function.

 -- David Eccles (gringer)

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

Re: [Denovoassembler-users] Kmer formats and colour-space / bit logic

Reply via email to