On Saturday, December 6, 2014 7:56:49 AM UTC-8, Stefan Karpinski wrote:
>
> On Fri, Dec 5, 2014 at 8:48 PM, David Koslicki <[email protected] 
> <javascript:>> wrote:
>
>> Thanks for the suggestions:
>>
>> On Friday, December 5, 2014 5:40:57 PM UTC-8, Jason Merrill wrote:
>>>
>>> This is the best you can do if
>>>
>>> 1. Every input in your space of possibilities is equally likely, and
>>> 2. You need to remember every input that you've seen
>>> 3. You need to know the order you saw them in
>>>
>>> If you just need to know "have I seen this input before," and you can 
>>> accept some false positives, then bloom filters are a good choice, as you 
>>> suggest, and John has an implementation of these in Julia: 
>>> https://github.com/johnmyleswhite/BloomFilters.jl
>>>
>>> I've tried using this, but it isn't quite "prime time ready" (
>> https://github.com/johnmyleswhite/BloomFilters.jl/issues/7)
>>
>
> Implementing your own Bloom filter really shouldn't be too hard. 
> Alternatively, it might not be too hard to file some issues against John's 
> package and get it into better working state. If you mention me in an 
> issue, I can also take a look at things. Having the BloomFilters package 
> working would probably be a good thing.
>

If you look at the link I provided, you'll see that I've previously done 
exactly as you're suggesting. 

>  
>
>> See also the BioJulia organization. Looks like they already have an 
>>> implementation of the 2-bits per base pair idea: https://github.com/
>>> BioJulia/BioSeq.jl
>>>
>> This is currently the approach I'm taking (with my own implementation). 
>>
>
> One issue with this may be the overhead per sequence. I think this package 
> is really designed for longer sequences since it uses array objects to 
> store sequences – a perfectly reasonable thing to do. There are a few ways 
> to avoid this kind of overhead. One would be to use a custom type something 
> like this:
>
> immutable ShortNucleotideSeq
>     bits::Uint128
> end
>
>
> You can store an array of these with no overhead per item – it's just 
> data. You may want to use the first byte to store the length or something 
> like that. Another approach would be to use a single long BioSeq "string" 
> and just implicitly use it to store lots of little sequences. You could 
> just pick some maximum number of nucleotides per sequence and assume that 
> each one is of that fixed length. Alternatively, you could store an array 
> of offsets.
>
Exactly, and that's why I'm using my own implementation.

I have thought using using a single string to store all the smaller 
strings, but I think it's even more computationally difficult to come up 
with a single (shortest) string that contains all my specified substrings 
(and no more).  Correct me if I am wrong on this point though, as that 
would be great!

Reply via email to