What is the ratio of used/unused keys in the final associative structure? That 
is, how many k-mers do you expect not to exist in your DNA sequence? Maybe one 
could pre-compress the key set by knowing which k-1-mers exist, then use a 
simple Vector, accepting some space wasted?

Couldn't a suffix tree of suffix array be used for k-mer counting? You'd just 
need to prune the tree at level k. 

Olaf
> Am 04.05.2017 um 12:05 schrieb Ketil Malde <ke...@malde.org>:
> 
> 
>> I know it may be a long shot, but did you consider using columnar data store 
>> like Apache Arrow?
> 
> Arrow might be an option, but is there a Haskell interface?  (Googling
> gives the obvious hits regarding arrows, and Google doesn't seem to care
> about me adding +apache to the search, it gives me result where
> "+apache" is overstruck.)
> 
>> Without knowing more about your application it is a bit difficult to produce 
>> more hints.
>> What is your application?
> 
> The short story is that I extract a number of 64-bit values from my
> data, and want to maintain frequency counts for each unique value.  So
> there'll be on the order of 10^9 (plus/minus an order of magnitude)
> unique values, with counts ranging from one to a few million (and large
> values being rare).
> 
> The long explanation is that I'm doing k-mer counts for molecular sequences,
> breaking DNA sequence data into overlapping words of fixed size (the
> parameter k), and counting their occurrences.  I encode them as Word64,
> using two bits per nucleotide (the alphabet is A, C, G, and T).  This is
> of course a fairly staple thing to do, and there is no lack of
> alternative programs that do it - but I'd like mine to work anyway, and
> it annoys me to have run into this particular bug.  Whether it is my own
> fault, in the Judy FFI, the GHC runtime or libraries, the libjudy code,
> GHC compilation issues, or a hardware error.
> 
> -k
> -- 
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Biohaskell mailing list
> Biohaskell@biohaskell.org
> http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell

Reply via email to