[Denovoassembler-users] RE : Kmer formats and colour-space / bit logic

Sébastien Boisvert Wed, 06 Jul 2011 15:08:29 -0700

Hi !

> ________________________________________
> De : Eccles, David [david.ecc...@mpi-muenster.mpg.de]
> Date d'envoi : 5 juillet 2011 06:55
> À : Sébastien Boisvert
> Cc : denovoassembler-users@lists.sourceforge.net
> Objet : RE: Kmer formats and colour-space / bit logic
> 
>> p.s.: To decode color-space manually, I use the automate in panel B
>> of
>>
> http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/jou
> rnal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002
> 
> Yes, thanks for that. I saw this link in the code, and it's now also
> what I refer back to when thinking about colour-space transforms.
> 
>> Color-space is not necessary I think,
>> m_parameters->getColorSpaceMode does that already.
> 
> But then you can't do nifty tricks like matching colour-space to
> base-space, which *can* be done by using a different k-mer format.
>


What do you mean exactly here ?

>> So we need 3 bits:
>> 1 bit: is the first base known ?
>> 1 bit for padding
>> 2 bits: if so, what is the first base ?
> 
> I realised that only 1 extra bit would be necessary, but given that
> it's much easier to have 2-bit alignments, it made sense to have every
> bit used (i.e. no padding bits). Hence the first base known +
> colour-space flag. I also realised that there's still unused bits (for
> an unknown first base in colour-space), so I've used those bits for a
> checksum -- it makes the code a bit more complex, but has the benefit
> of being able to identify data corruption.
> 

I don't see the point of doing checksums for k-mers because the only data that 
are communicated transit with the message-passing interface. And I believe the 
underlying 
bit transfer layers (TCP, Infiniband, or another one) already verify data 
integrity.

>> I agree with you overall, but I think it would be easier to put these
>> informations in the last 4 bits of a Kmer (a k-mer being an array of
>> uint64_t).
> 
> So in positions 60-63 when using 1 64-bit number, positions 125-128
> when using 2, etc.? That means the location of the flags is less easy
> to determine. I suppose you could put them always in positions 60-63
> (i.e. at the end of the first array entry), but that's pretty much the
> same as positions 0-3.
> 

The location is easy to locate -- it starting bit is basically 2*kmerLength,
assuming kmerLength+2<=MAXMERLENGTH.

I know that doing it this way would not break the code, I think you would just 
need to change the hashing functions
to reset  (set to 0) all the fields starting at 2*kmerLength in a Kmer.

>>> Adding this will use 2 extra bits, making the max kmer length for one
>>> 64-bit value 31 bases
>> 4 bits, not 2 bits
> 
> 2 additional bits. Consider that a colour-space sequence with starting
> base is the same length as the equivalent base-space sequence [with
> starting base]:
> 
> T 0122123
> T TGAGTCG
> 
> The first base from both colour-space and base-space is stored in the
> 'first base' location in the modified k-mer. The k-mer length is 1 +
> the number of bases stored in the remainder of the array.
> 
>> Routines for k-mers are in core/common_functions.h and
>> structures/Kmer.h -- it would not call that 'all over the place' given
>> the ~20k lines of Ray.
> 
> This is very much a personal style issue, but code is more readable
> and easier for me to work with if class-specific things go into the
> particular class they are working on -- this is why I've been changing
> code locations as I work through the code. Certainly, if these k-mer
> routines were only in common_functions/Kmer, it would be almost no
> problem to change the k-mer format. However, Just doing a search in
> the code for getU64/setU64 demonstrates that assumptions about a
> particular k-mer format are make in a few other places in the code:
> 

I totally agree with you, but keep in mind that 1 month ago the class Kmer did 
not exist.

And in January 2010 I had made the design decision of limiting k-mer length 
below 32.

All the k-mers were running around using uint64_t variables.

So having k-mer-related routines in two files is basically just a legacy.



> $ grep -rl '\(getU64\|setU64\)' code/* unit-tests/*
> code/assembler/KmerAcademyBuilder.cpp
> code/assembler/VerticesExtractor.cpp
> code/assembler/FusionData.cpp
> code/communication/MessageProcessor.cpp
> code/core/common_functions.cpp
> code/core/common_functions.h
> code/structures/Kmer.cpp
> code/structures/Kmer.h
> unit-tests/test_uniform.cpp
> 
>> #define RAY_NUCLEOTIDE_A 0 /* ~00 == 11 */
>> #define RAY_NUCLEOTIDE_C 1 /* ~01 == 10 */
>> #define RAY_NUCLEOTIDE_G 2 /* ~10 == 01 */
>> #define RAY_NUCLEOTIDE_T 3 /* ~11 == 00 */
> 
> Just as another warning, I tripped myself up when I was adjusting the
> k-mer format because when I think arrays, I think lower number at the
> left, but our number system has the lower number on the right. This
> also applies to declaring binary numbers in c++ (i.e. 0bXXXXX), so
> 0b1100 is 12d, rather than 3d.
>

"It is common to assign each bit a position number, ranging from zero to N-1, 
where N is the number of bits in the binary representation used. Normally, this 
is simply the exponent for the corresponding bit weight in base-2."

- http://en.wikipedia.org/wiki/Most_significant_bit


>> you need to change some code before claiming a victory.
> 
> Almost done, I think. The unit tests are (finally) passing, but I
> still need to implement extracting a colour-space k-mer from the
> middle of a read, and add in my own unit-tests to make sure my
> additional code is doing the right thing. My current thoughts are
> that the first-base would only be filled in for the first 20 or
> so bases of a read (customisable, of course).
> 
> It would also be nice if comparison functions (.isEqual, etc.) filled
> in unknown bases if tested against something that matched everywhere
> else. That would probably mean not using operators, because they
> assume that the LHS/RHS values aren't modified in the comparison
> process -- this is a good assumption to hold onto, so perhaps instead
> of using those functions, there should be something like a
> '.equalsAndCopyKnownFirstBase(Kmer a)' function.
> 
> -- David Eccles (gringer)
> 


Good luck with your fork !

                                                     Sébastien

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Denovoassembler-users mailing list
Denovoassembler-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/denovoassembler-users

[Denovoassembler-users] RE : Kmer formats and colour-space / bit logic

Reply via email to