Being able to suggest an appropriate compression algorithm would require
knowing both how you intend to use the data (so that you don’t kill
performance by packing the data wrong) and what the data looks like (where
the “entropy” http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
exists in the data).

For example, if you just want a smaller storage format, gzip would probably
do quite well. However, gzip is a stream compressor, and cannot do random
seeks. If there’s a lot of redundancy, then you could store them in a
Dictionary object (Dict{ASCIIString, Int} of string=>count mappings). If
there is a lot of common prefixes, then a common prefix tree (similar to
the Dictionary, but where branch stores a count and a pointer to the next
object:

type Node
  count::Int
  A::Node
  C::Node
  T::Node
  G::Node
  Node() = new(1)
end

)

On Fri Dec 05 2014 at 8:28:05 PM David Koslicki <[email protected]>
wrote:

Duly noted, though I did get my answer (no) pretty quickly! ;)
> Of course, the main problem is still an issue. But then again, it's kind
> of an "open problem" in bioinformatics (so I don't think this would be the
> correct forum to ask it in).
>
> I appreciate your help!
>
>
> On Friday, December 5, 2014 5:23:56 PM UTC-8, Jason Merrill wrote:
>>
>> As a meta point, beware the XY problem: http://meta.stackexchange.com/
>> a/66378
>>
>> In other words, you'll typically get better answers faster if you start
>> with the broad context, like
>>
>> On Friday, December 5, 2014 5:13:49 PM UTC-8, David Koslicki wrote:
>>>>
>>>> I have strings (on the alphabet {A,C,T,G}) of length 30 to 50. I am
>>>> trying to hash them to save on space (as I have a few million to billion of
>>>> them).
>>>>
>>>
>> And then suggest ideas for how to solve it that you've thought of that
>> don't quite work yet, like
>>
>>
>>> I know I should be using a bloom filter (http://en.wikipedia.org/wiki/
>>>> Bloom_filter) or some other such space-saving data structure, but I'm
>>>> too lazy/busy/inexperienced to do it by hand.
>>>>
>>>  or
>>
>>> > Is there a built in function that will undo hash()?
>>>>>>>>
>>>>>>>
>>
>>
> ​

Reply via email to