What is the ratio of used/unused keys in the final associative structure? That is, how many k-mers do you expect not to exist in your DNA sequence? Maybe one could pre-compress the key set by knowing which k-1-mers exist, then use a simple Vector, accepting some space wasted?
Couldn't a suffix tree of suffix array be used for k-mer counting? You'd just need to prune the tree at level k. Olaf > Am 04.05.2017 um 12:05 schrieb Ketil Malde <ke...@malde.org>: > > >> I know it may be a long shot, but did you consider using columnar data store >> like Apache Arrow? > > Arrow might be an option, but is there a Haskell interface? (Googling > gives the obvious hits regarding arrows, and Google doesn't seem to care > about me adding +apache to the search, it gives me result where > "+apache" is overstruck.) > >> Without knowing more about your application it is a bit difficult to produce >> more hints. >> What is your application? > > The short story is that I extract a number of 64-bit values from my > data, and want to maintain frequency counts for each unique value. So > there'll be on the order of 10^9 (plus/minus an order of magnitude) > unique values, with counts ranging from one to a few million (and large > values being rare). > > The long explanation is that I'm doing k-mer counts for molecular sequences, > breaking DNA sequence data into overlapping words of fixed size (the > parameter k), and counting their occurrences. I encode them as Word64, > using two bits per nucleotide (the alphabet is A, C, G, and T). This is > of course a fairly staple thing to do, and there is no lack of > alternative programs that do it - but I'd like mine to work anyway, and > it annoys me to have run into this particular bug. Whether it is my own > fault, in the Judy FFI, the GHC runtime or libraries, the libjudy code, > GHC compilation issues, or a hardware error. > > -k > -- > If I haven't seen further, it is by standing in the footprints of giants > _______________________________________________ > Biohaskell mailing list > Biohaskell@biohaskell.org > http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://biohaskell.org/cgi-bin/mailman/listinfo/biohaskell