Re: [Biohaskell] hash map/associative structure

Olaf Klinke Tue, 09 May 2017 12:52:56 -0700

> Am 09.05.2017 um 09:38 schrieb Ketil Malde <ke...@malde.org>:
> 
> 
>> What is the ratio of used/unused keys in the final associative
>> structure? That is, how many k-mers do you expect not to exist in your
>> DNA sequence? Maybe one could pre-compress the key set by knowing
>> which k-1-mers exist, then use a simple Vector, accepting some space
>> wasted? 
> 
> Perhaps it is possible to construct perfect hashes for k+1-mers if you
> know the k-mer counts?
> 
>> Couldn't a suffix tree of suffix array be used for k-mer counting?
>> You'd just need to prune the tree at level k.  
> 
> Yes, that's clearly possible.  Another common (but approximate) approach
> is to use a Bloom filter for the count 1 k-mers (and only store explicit
> counts -- two or more -- for k-mers already identified as seen by the
> Bloom filter)
I'm not familiar with hash tables, but a fan of trie structures. If you 
construct a trie of all the 32-mers, you'd have  a lot of sharing between the 
keys. The smaller the alphabet, the more sharing. In order not to waste space, 
maybe cram 2 nucleotides into a byte. You can not do anything about the memory 
footprint of the counts, afaik.


> 
> But if I'm going to implement something here, I think I'll go with a
> custom hash table of 64bit values, with maybe 48 bits of key and 16 bits of
> value (counts), and use an IntMap to store whatever prefix necessary
> (e.g. 24 bits for k=32), as well as another Map to store overflowing
> counts. Or something like that.
Writing something yourself would make the code pure, no IO, freeze etc. In your 
situation it might be an advantage. 
I myself had a situation where I wouldn't believe that a pure pointer structure 
might fit in memory, but it did. I wrote a Lempel-Ziv index structure that kept 
the trie for its construction in memory, so that one can use it for full-text 
search. Despite many simple design choices, it can hold at least human chr1 in 
few GB of RAM and the compressed file is indeed smaller than the fasta. 

(I never understood why all aligners simply decompress a huge gzipped fastq 
file and align read by read, instead of leveraging the redundancy information 
already present in the compression.) 

Olaf
> 
> -k
> -- 
> If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] hash map/associative structure

Reply via email to