Thanks for the suggestions:

On Friday, December 5, 2014 5:40:57 PM UTC-8, Jason Merrill wrote:
>
> This is the best you can do if
>
> 1. Every input in your space of possibilities is equally likely, and
> 2. You need to remember every input that you've seen
> 3. You need to know the order you saw them in
>
> If you just need to know "have I seen this input before," and you can 
> accept some false positives, then bloom filters are a good choice, as you 
> suggest, and John has an implementation of these in Julia: 
> https://github.com/johnmyleswhite/BloomFilters.jl
>
> I've tried using this, but it isn't quite "prime time ready" 
(https://github.com/johnmyleswhite/BloomFilters.jl/issues/7)
 

> If you just need to know "how many different inputs have I seen," then the 
> HyperLogLog is another semi-miraculous data structure for doing that, and 
> John also has an implementation of these in Julia: 
> https://github.com/johnmyleswhite/HyperLogLog.jl
>
> I hadn't hear of this before! But I will certainly look into it!
 

> As you can see, John's been busy :-)
>
> But probably there are other things you want to query about your data. 
> There might be good answers to some of those questions too if you ask them. 
> In general, you might want to read up on Streaming Algorithms: 
> http://en.wikipedia.org/wiki/Streaming_algorithm
>
> See also the BioJulia organization. Looks like they already have an 
> implementation of the 2-bits per base pair idea: 
> https://github.com/BioJulia/BioSeq.jl
>
This is currently the approach I'm taking (with my own implementation). 

>
> On Friday, December 5, 2014 5:25:04 PM UTC-8, David Koslicki wrote:
>>
>> Good suggestion, but I've tried that already, and besides the fact that 
>> the HDF5 package (https://github.com/timholy/HDF5.jl) doesn't yet 
>> support Int128, this would result in file sizes upwards of 750Gb (too large 
>> for my purposes).
>>
>> On Friday, December 5, 2014 5:19:00 PM UTC-8, Jason Merrill wrote:
>>>
>>> Here's one possibility:
>>>
>>> Interpret A, C, T, G as two bit integers, i.e. A=00, C=01, T=10, G=11. A 
>>> string of up to 50 of these has 2*50=100 bits, so you could store any such 
>>> string as a unique Int128.
>>>
>>

Reply via email to