Here's one possibility:
Interpret A, C, T, G as two bit integers, i.e. A=00, C=01, T=10, G=11. A
string of up to 50 of these has 2*50=100 bits, so you could store any such
string as a unique Int128.
On Friday, December 5, 2014 5:13:49 PM UTC-8, David Koslicki wrote:
>
> I have strings (on the alphabet {A,C,T,G}) of length 30 to 50. I am trying
> to hash them to save on space (as I have a few million to billion of them).
> I know I should be using a bloom filter (
> http://en.wikipedia.org/wiki/Bloom_filter) or some other such
> space-saving data structure, but I'm too lazy/busy/inexperienced to do it
> by hand.
>
> On Friday, December 5, 2014 5:10:54 PM UTC-8, Jason Merrill wrote:
>>
>> There might be a good solution to the particular problem you're trying to
>> solve, though. What are you trying to do?
>>
>> On Friday, December 5, 2014 5:08:08 PM UTC-8, John Myles White wrote:
>>>
>>> For specialized cases it is possible to achieve 1-1-ness:
>>> http://en.wikipedia.org/wiki/Perfect_hash_function
>>>
>>> But this is not something that most people aspire to do for most types
>>> since 1-1-ness isn't essential in most applications and is costly to
>>> achieve.
>>>
>>> -- John
>>>
>>> On Dec 5, 2014, at 5:03 PM, David Koslicki <[email protected]> wrote:
>>>
>>> Ah, of course! I was hoping that on certain data types it was 1-1, but I
>>> guess that was a long shot. Thanks for clarifying.
>>>
>>> On Friday, December 5, 2014 4:57:41 PM UTC-8, Jason Merrill wrote:
>>>>
>>>> If the space of possible hashes is smaller than the space of possible
>>>> inputs (e.g. the hash is represented with fewer bits than the input data
>>>> is), which is typically the case, then you can use the Pigeonhole
>>>> Principle
>>>> to prove what John wrote:
>>>>
>>>> https://en.wikipedia.org/wiki/Pigeonhole_principle
>>>>
>>>> On Friday, December 5, 2014 4:35:18 PM UTC-8, John Myles White wrote:
>>>>>
>>>>> This function is impossible to write in generality since hash
>>>>> functions aren't one-to-one.
>>>>>
>>>>> -- John
>>>>>
>>>>> On Dec 5, 2014, at 4:32 PM, David Koslicki <[email protected]>
>>>>> wrote:
>>>>>
>>>>> > Hello,
>>>>> >
>>>>> > Is there a built in function that will undo hash()?
>>>>> >
>>>>> > i.e. I am looking for a function "dehash()" such that
>>>>> > dehash(hash("ACTG")) == "ACTG"
>>>>> >
>>>>> > I can't seem to find this anywhere (documentation, google, this user
>>>>> group, etc).
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> > ~David
>>>>>
>>>>>
>>>