Hi ! > ________________________________________ > De : Eccles, David [david.ecc...@mpi-muenster.mpg.de] > Date d'envoi : 5 juillet 2011 06:55 > À : Sébastien Boisvert > Cc : denovoassembler-users@lists.sourceforge.net > Objet : RE: Kmer formats and colour-space / bit logic > >> p.s.: To decode color-space manually, I use the automate in panel B >> of >> > http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/jou > rnal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002 > > Yes, thanks for that. I saw this link in the code, and it's now also > what I refer back to when thinking about colour-space transforms. > >> Color-space is not necessary I think, >> m_parameters->getColorSpaceMode does that already. > > But then you can't do nifty tricks like matching colour-space to > base-space, which *can* be done by using a different k-mer format. >
What do you mean exactly here ? >> So we need 3 bits: >> 1 bit: is the first base known ? >> 1 bit for padding >> 2 bits: if so, what is the first base ? > > I realised that only 1 extra bit would be necessary, but given that > it's much easier to have 2-bit alignments, it made sense to have every > bit used (i.e. no padding bits). Hence the first base known + > colour-space flag. I also realised that there's still unused bits (for > an unknown first base in colour-space), so I've used those bits for a > checksum -- it makes the code a bit more complex, but has the benefit > of being able to identify data corruption. > I don't see the point of doing checksums for k-mers because the only data that are communicated transit with the message-passing interface. And I believe the underlying bit transfer layers (TCP, Infiniband, or another one) already verify data integrity. >> I agree with you overall, but I think it would be easier to put these >> informations in the last 4 bits of a Kmer (a k-mer being an array of >> uint64_t). > > So in positions 60-63 when using 1 64-bit number, positions 125-128 > when using 2, etc.? That means the location of the flags is less easy > to determine. I suppose you could put them always in positions 60-63 > (i.e. at the end of the first array entry), but that's pretty much the > same as positions 0-3. > The location is easy to locate -- it starting bit is basically 2*kmerLength, assuming kmerLength+2<=MAXMERLENGTH. I know that doing it this way would not break the code, I think you would just need to change the hashing functions to reset (set to 0) all the fields starting at 2*kmerLength in a Kmer. >>> Adding this will use 2 extra bits, making the max kmer length for one >>> 64-bit value 31 bases >> 4 bits, not 2 bits > > 2 additional bits. Consider that a colour-space sequence with starting > base is the same length as the equivalent base-space sequence [with > starting base]: > > T 0122123 > T TGAGTCG > > The first base from both colour-space and base-space is stored in the > 'first base' location in the modified k-mer. The k-mer length is 1 + > the number of bases stored in the remainder of the array. > >> Routines for k-mers are in core/common_functions.h and >> structures/Kmer.h -- it would not call that 'all over the place' given >> the ~20k lines of Ray. > > This is very much a personal style issue, but code is more readable > and easier for me to work with if class-specific things go into the > particular class they are working on -- this is why I've been changing > code locations as I work through the code. Certainly, if these k-mer > routines were only in common_functions/Kmer, it would be almost no > problem to change the k-mer format. However, Just doing a search in > the code for getU64/setU64 demonstrates that assumptions about a > particular k-mer format are make in a few other places in the code: > I totally agree with you, but keep in mind that 1 month ago the class Kmer did not exist. And in January 2010 I had made the design decision of limiting k-mer length below 32. All the k-mers were running around using uint64_t variables. So having k-mer-related routines in two files is basically just a legacy. > $ grep -rl '\(getU64\|setU64\)' code/* unit-tests/* > code/assembler/KmerAcademyBuilder.cpp > code/assembler/VerticesExtractor.cpp > code/assembler/FusionData.cpp > code/communication/MessageProcessor.cpp > code/core/common_functions.cpp > code/core/common_functions.h > code/structures/Kmer.cpp > code/structures/Kmer.h > unit-tests/test_uniform.cpp > >> #define RAY_NUCLEOTIDE_A 0 /* ~00 == 11 */ >> #define RAY_NUCLEOTIDE_C 1 /* ~01 == 10 */ >> #define RAY_NUCLEOTIDE_G 2 /* ~10 == 01 */ >> #define RAY_NUCLEOTIDE_T 3 /* ~11 == 00 */ > > Just as another warning, I tripped myself up when I was adjusting the > k-mer format because when I think arrays, I think lower number at the > left, but our number system has the lower number on the right. This > also applies to declaring binary numbers in c++ (i.e. 0bXXXXX), so > 0b1100 is 12d, rather than 3d. > "It is common to assign each bit a position number, ranging from zero to N-1, where N is the number of bits in the binary representation used. Normally, this is simply the exponent for the corresponding bit weight in base-2." - http://en.wikipedia.org/wiki/Most_significant_bit >> you need to change some code before claiming a victory. > > Almost done, I think. The unit tests are (finally) passing, but I > still need to implement extracting a colour-space k-mer from the > middle of a read, and add in my own unit-tests to make sure my > additional code is doing the right thing. My current thoughts are > that the first-base would only be filled in for the first 20 or > so bases of a read (customisable, of course). > > It would also be nice if comparison functions (.isEqual, etc.) filled > in unknown bases if tested against something that matched everywhere > else. That would probably mean not using operators, because they > assume that the LHS/RHS values aren't modified in the comparison > process -- this is a good assumption to hold onto, so perhaps instead > of using those functions, there should be something like a > '.equalsAndCopyKnownFirstBase(Kmer a)' function. > > -- David Eccles (gringer) > Good luck with your fork ! Sébastien ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users