On Saturday, December 6, 2014 7:56:49 AM UTC-8, Stefan Karpinski wrote: > > On Fri, Dec 5, 2014 at 8:48 PM, David Koslicki <[email protected] > <javascript:>> wrote: > >> Thanks for the suggestions: >> >> On Friday, December 5, 2014 5:40:57 PM UTC-8, Jason Merrill wrote: >>> >>> This is the best you can do if >>> >>> 1. Every input in your space of possibilities is equally likely, and >>> 2. You need to remember every input that you've seen >>> 3. You need to know the order you saw them in >>> >>> If you just need to know "have I seen this input before," and you can >>> accept some false positives, then bloom filters are a good choice, as you >>> suggest, and John has an implementation of these in Julia: >>> https://github.com/johnmyleswhite/BloomFilters.jl >>> >>> I've tried using this, but it isn't quite "prime time ready" ( >> https://github.com/johnmyleswhite/BloomFilters.jl/issues/7) >> > > Implementing your own Bloom filter really shouldn't be too hard. > Alternatively, it might not be too hard to file some issues against John's > package and get it into better working state. If you mention me in an > issue, I can also take a look at things. Having the BloomFilters package > working would probably be a good thing. >
If you look at the link I provided, you'll see that I've previously done exactly as you're suggesting. > > >> See also the BioJulia organization. Looks like they already have an >>> implementation of the 2-bits per base pair idea: https://github.com/ >>> BioJulia/BioSeq.jl >>> >> This is currently the approach I'm taking (with my own implementation). >> > > One issue with this may be the overhead per sequence. I think this package > is really designed for longer sequences since it uses array objects to > store sequences – a perfectly reasonable thing to do. There are a few ways > to avoid this kind of overhead. One would be to use a custom type something > like this: > > immutable ShortNucleotideSeq > bits::Uint128 > end > > > You can store an array of these with no overhead per item – it's just > data. You may want to use the first byte to store the length or something > like that. Another approach would be to use a single long BioSeq "string" > and just implicitly use it to store lots of little sequences. You could > just pick some maximum number of nucleotides per sequence and assume that > each one is of that fixed length. Alternatively, you could store an array > of offsets. > Exactly, and that's why I'm using my own implementation. I have thought using using a single string to store all the smaller strings, but I think it's even more computationally difficult to come up with a single (shortest) string that contains all my specified substrings (and no more). Correct me if I am wrong on this point though, as that would be great!
