Thanks for the suggestions: On Friday, December 5, 2014 5:40:57 PM UTC-8, Jason Merrill wrote: > > This is the best you can do if > > 1. Every input in your space of possibilities is equally likely, and > 2. You need to remember every input that you've seen > 3. You need to know the order you saw them in > > If you just need to know "have I seen this input before," and you can > accept some false positives, then bloom filters are a good choice, as you > suggest, and John has an implementation of these in Julia: > https://github.com/johnmyleswhite/BloomFilters.jl > > I've tried using this, but it isn't quite "prime time ready" (https://github.com/johnmyleswhite/BloomFilters.jl/issues/7)
> If you just need to know "how many different inputs have I seen," then the > HyperLogLog is another semi-miraculous data structure for doing that, and > John also has an implementation of these in Julia: > https://github.com/johnmyleswhite/HyperLogLog.jl > > I hadn't hear of this before! But I will certainly look into it! > As you can see, John's been busy :-) > > But probably there are other things you want to query about your data. > There might be good answers to some of those questions too if you ask them. > In general, you might want to read up on Streaming Algorithms: > http://en.wikipedia.org/wiki/Streaming_algorithm > > See also the BioJulia organization. Looks like they already have an > implementation of the 2-bits per base pair idea: > https://github.com/BioJulia/BioSeq.jl > This is currently the approach I'm taking (with my own implementation). > > On Friday, December 5, 2014 5:25:04 PM UTC-8, David Koslicki wrote: >> >> Good suggestion, but I've tried that already, and besides the fact that >> the HDF5 package (https://github.com/timholy/HDF5.jl) doesn't yet >> support Int128, this would result in file sizes upwards of 750Gb (too large >> for my purposes). >> >> On Friday, December 5, 2014 5:19:00 PM UTC-8, Jason Merrill wrote: >>> >>> Here's one possibility: >>> >>> Interpret A, C, T, G as two bit integers, i.e. A=00, C=01, T=10, G=11. A >>> string of up to 50 of these has 2*50=100 bits, so you could store any such >>> string as a unique Int128. >>> >>
