> p.s.: To decode color-space manually, I use the automate in panel B > of > http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/jou rnal.pcbi.1000386&imageURI=info:doi/10.1371/journal.pcbi.1000386.g002
Yes, thanks for that. I saw this link in the code, and it's now also what I refer back to when thinking about colour-space transforms. > Color-space is not necessary I think, > m_parameters->getColorSpaceMode does that already. But then you can't do nifty tricks like matching colour-space to base-space, which *can* be done by using a different k-mer format. > So we need 3 bits: > 1 bit: is the first base known ? > 1 bit for padding > 2 bits: if so, what is the first base ? I realised that only 1 extra bit would be necessary, but given that it's much easier to have 2-bit alignments, it made sense to have every bit used (i.e. no padding bits). Hence the first base known + colour-space flag. I also realised that there's still unused bits (for an unknown first base in colour-space), so I've used those bits for a checksum -- it makes the code a bit more complex, but has the benefit of being able to identify data corruption. > I agree with you overall, but I think it would be easier to put these > informations in the last 4 bits of a Kmer (a k-mer being an array of > uint64_t). So in positions 60-63 when using 1 64-bit number, positions 125-128 when using 2, etc.? That means the location of the flags is less easy to determine. I suppose you could put them always in positions 60-63 (i.e. at the end of the first array entry), but that's pretty much the same as positions 0-3. >> Adding this will use 2 extra bits, making the max kmer length for one >> 64-bit value 31 bases > 4 bits, not 2 bits 2 additional bits. Consider that a colour-space sequence with starting base is the same length as the equivalent base-space sequence [with starting base]: T 0122123 T TGAGTCG The first base from both colour-space and base-space is stored in the 'first base' location in the modified k-mer. The k-mer length is 1 + the number of bases stored in the remainder of the array. > Routines for k-mers are in core/common_functions.h and > structures/Kmer.h -- it would not call that 'all over the place' given > the ~20k lines of Ray. This is very much a personal style issue, but code is more readable and easier for me to work with if class-specific things go into the particular class they are working on -- this is why I've been changing code locations as I work through the code. Certainly, if these k-mer routines were only in common_functions/Kmer, it would be almost no problem to change the k-mer format. However, Just doing a search in the code for getU64/setU64 demonstrates that assumptions about a particular k-mer format are make in a few other places in the code: $ grep -rl '\(getU64\|setU64\)' code/* unit-tests/* code/assembler/KmerAcademyBuilder.cpp code/assembler/VerticesExtractor.cpp code/assembler/FusionData.cpp code/communication/MessageProcessor.cpp code/core/common_functions.cpp code/core/common_functions.h code/structures/Kmer.cpp code/structures/Kmer.h unit-tests/test_uniform.cpp > #define RAY_NUCLEOTIDE_A 0 /* ~00 == 11 */ > #define RAY_NUCLEOTIDE_C 1 /* ~01 == 10 */ > #define RAY_NUCLEOTIDE_G 2 /* ~10 == 01 */ > #define RAY_NUCLEOTIDE_T 3 /* ~11 == 00 */ Just as another warning, I tripped myself up when I was adjusting the k-mer format because when I think arrays, I think lower number at the left, but our number system has the lower number on the right. This also applies to declaring binary numbers in c++ (i.e. 0bXXXXX), so 0b1100 is 12d, rather than 3d. > you need to change some code before claiming a victory. Almost done, I think. The unit tests are (finally) passing, but I still need to implement extracting a colour-space k-mer from the middle of a read, and add in my own unit-tests to make sure my additional code is doing the right thing. My current thoughts are that the first-base would only be filled in for the first 20 or so bases of a read (customisable, of course). It would also be nice if comparison functions (.isEqual, etc.) filled in unknown bases if tested against something that matched everywhere else. That would probably mean not using operators, because they assume that the LHS/RHS values aren't modified in the comparison process -- this is a good assumption to hold onto, so perhaps instead of using those functions, there should be something like a '.equalsAndCopyKnownFirstBase(Kmer a)' function. -- David Eccles (gringer) ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Denovoassembler-users mailing list Denovoassembler-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/denovoassembler-users