> What is the ratio of used/unused keys in the final associative
> structure? That is, how many k-mers do you expect not to exist in your
> DNA sequence? Maybe one could pre-compress the key set by knowing
> which k-1-mers exist, then use a simple Vector, accepting some space
> wasted? 

Perhaps it is possible to construct perfect hashes for k+1-mers if you
know the k-mer counts?

> Couldn't a suffix tree of suffix array be used for k-mer counting?
> You'd just need to prune the tree at level k.  

Yes, that's clearly possible.  Another common (but approximate) approach
is to use a Bloom filter for the count 1 k-mers (and only store explicit
counts -- two or more -- for k-mers already identified as seen by the
Bloom filter)

But if I'm going to implement something here, I think I'll go with a
custom hash table of 64bit values, with maybe 48 bits of key and 16 bits of
value (counts), and use an IntMap to store whatever prefix necessary
(e.g. 24 bits for k=32), as well as another Map to store overflowing
counts. Or something like that.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

Reply via email to