> What is the ratio of used/unused keys in the final associative > structure? That is, how many k-mers do you expect not to exist in your > DNA sequence? Maybe one could pre-compress the key set by knowing > which k-1-mers exist, then use a simple Vector, accepting some space > wasted?
Perhaps it is possible to construct perfect hashes for k+1-mers if you know the k-mer counts? > Couldn't a suffix tree of suffix array be used for k-mer counting? > You'd just need to prune the tree at level k. Yes, that's clearly possible. Another common (but approximate) approach is to use a Bloom filter for the count 1 k-mers (and only store explicit counts -- two or more -- for k-mers already identified as seen by the Bloom filter) But if I'm going to implement something here, I think I'll go with a custom hash table of 64bit values, with maybe 48 bits of key and 16 bits of value (counts), and use an IntMap to store whatever prefix necessary (e.g. 24 bits for k=32), as well as another Map to store overflowing counts. Or something like that. -k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell