> Am 09.05.2017 um 09:38 schrieb Ketil Malde <ke...@malde.org>: > > >> What is the ratio of used/unused keys in the final associative >> structure? That is, how many k-mers do you expect not to exist in your >> DNA sequence? Maybe one could pre-compress the key set by knowing >> which k-1-mers exist, then use a simple Vector, accepting some space >> wasted? > > Perhaps it is possible to construct perfect hashes for k+1-mers if you > know the k-mer counts? > >> Couldn't a suffix tree of suffix array be used for k-mer counting? >> You'd just need to prune the tree at level k. > > Yes, that's clearly possible. Another common (but approximate) approach > is to use a Bloom filter for the count 1 k-mers (and only store explicit > counts -- two or more -- for k-mers already identified as seen by the > Bloom filter) I'm not familiar with hash tables, but a fan of trie structures. If you construct a trie of all the 32-mers, you'd have a lot of sharing between the keys. The smaller the alphabet, the more sharing. In order not to waste space, maybe cram 2 nucleotides into a byte. You can not do anything about the memory footprint of the counts, afaik.
> > But if I'm going to implement something here, I think I'll go with a > custom hash table of 64bit values, with maybe 48 bits of key and 16 bits of > value (counts), and use an IntMap to store whatever prefix necessary > (e.g. 24 bits for k=32), as well as another Map to store overflowing > counts. Or something like that. Writing something yourself would make the code pure, no IO, freeze etc. In your situation it might be an advantage. I myself had a situation where I wouldn't believe that a pure pointer structure might fit in memory, but it did. I wrote a Lempel-Ziv index structure that kept the trie for its construction in memory, so that one can use it for full-text search. Despite many simple design choices, it can hold at least human chr1 in few GB of RAM and the compressed file is indeed smaller than the fasta. (I never understood why all aligners simply decompress a huge gzipped fastq file and align read by read, instead of leveraging the redundancy information already present in the compression.) Olaf > > -k > -- > If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell