Re: [nodejs] Announcement/Plea for help: Golems.io a Fixtures-As-A-Service solution

Ken Fri, 12 Jul 2013 14:34:10 -0700

Thanks for the link, I think that is pretty similar in intent and will 
contact the maintainer.  The main differences are that the golem generation 
algorithms aren't random--they're deterministic, reproducible, and 
reversible, which in turn gives golems persistence without needing to store 
them in a database or a file.  This means you have access to billions of 
different golems indefinitely, at essentially zero carrying cost.  I put a 
lot of work into making the algorithms fast as well, I can generate the 
full attribute set for about 10,000 golems per second on my laptop, but the 
algorithm is infinitely parallelizable so with enough CPU you could 
generate millions a second--there's no database or filesystem to get in the 
way.


Why does reproducibility matter?  Most modern applications are aware that 
they exist within a complex environment populated with other applications 
and systems.  They're often exposed through multiple outlets, like web and 
mobile.  They integrate with external systems like email and Facebook.   In 
some sense the golems are just as "real" as real people, at least from the 
perspective of applications.   They have a name, they have an email 
address.  They have a phone number.   Within the right sandbox they can 
even have a Facebook account.   Real people can ensure consistency between 
multiple systems by entering the same data (username, email address, 
password) in multiple places because they "know" those things.  Golems 
offer something similar, as long as you know or can determine the seed 
number for a golem you can "remember" all of its other attributes (even 
ones you've never asked about before).  Because the algorihms are 
reversible you can actually "look up" a golem, e.g. I could find you all 
the golems that live in Seattle and are named Ken, or you can find a unique 
golem by searching on a few attributes that aren't inherently unique (like 
name and zipcode).

Here's a more complex usage that perhaps shows off the potential better: 
anonymizing real customer data for use in test and development.  Imagine 
you have a big, long-lived website like Amazon.com.  You store data in a 
whole bunch of databases which are explicitly loosely connected via shared 
keys, e.g. an orders database and the customers database both hold an 
account_id.  But there's also a whole bunch of implicit connections, e.g. 
the person's name may appear on an invoice in the orders database, even 
though it's mastered in the customers database.   Let's say you have some 
new code you'd like to test out on realistic data, but you're prohibited 
for privacy reasons from actually copying production data to your test 
environment.  A typical approach would be to use a one way hash to 
transform the production data into gibberish (e.g. Ken Woodruff becomes Obd 
Deacebee).  That has a few drawbacks, the obvious one being that the data 
looks like gibberish.  Another is that it's very likely to break those 
implicit connections, e.g. if the name in the orders is stored as a single 
full name but in the customers database its separate first and last names 
the hash is unlikely to produce consistent output in both environments.  
There are also known issues with the security of this approach--if the 
hashing algorithm is known or guessable you can often determine the 
original value.  If instead of hashing you used realistic random data it 
won't look like gibberish and wouldn't be reversible but you'd still be 
very challenged to make it consistent.  You could try converting the 
customer database first using random data, and then use lookups to populate 
data in the orders database, but this would be insanely slow.  With golems 
its a lot easier: just use a shared unique number like the account_id as a 
golem seed and retrieve whatever attributes are needed for the database at 
hand.  You wind up with realistic data that is nearly as realistic and just 
as coherent as the original, but has absolutely no relationship with it.

On Friday, July 12, 2013 1:25:39 PM UTC-7, Martin Cooper wrote:
>
> What you're describing sounds a lot like this:
>
> http://www.generatedata.com/
>
> What would be the key differentiators?
>
> --
> Martin Cooper
>
>
> On Fri, Jul 12, 2013 at 12:45 PM, Ken <[email protected] 
> <javascript:>>wrote:
>
>>
>> TL;DR: I started something I can't finish, want to help me?
>>
>> Many times in my career I've found a need for large volumes of realistic 
>> test data (aka "fixtures"),
>> and had long had a thought at the back of my mind that it could be well 
>> provided by a service.
>> Last year I had time to work on the idea (using node.js), and made some 
>> good progress building
>> the core technology in a project I'm calling golems.io.  Later in the 
>> year I got
>> sucked into a new venture (http://www.snupi.com/) and no longer have 
>> time to dedicate to this.
>> However I don't want the effort so far to go to waste, and am wondering 
>> if 1) the 
>> community thinks this is a potentially valuable service, and 2) if there 
>> are any
>> people or organizations out there interested in taking it on.  
>>
>> So what's the point of this?  There are many, but the most obvious use 
>> case I can think of: automated testing of web sites/services.
>> Consider this form filling example from zombie.js 
>> <http://zombie.labnotes.org/#Feeding>
>>
>>   browser.
>>     fill("Your Name", "Arm Biter").
>>     fill("Profession", "Living dead").
>>     select("Born", "1968").
>>     uncheck("Send me the newsletter").
>>     pressButton("Sign me up", function() {
>>       // Make sure we got redirected to thank you page.
>>       assert.equal(browser.location.pathname, "/thankyou");
>>     });
>>
>> Consider the second arguments to fill and select--where do these come 
>> from?  Hard-coded values are perhaps fine 
>> for a simple unit test, but what if you wanted to create a few hundred or 
>> a few thousand subscribers for 
>> your newsletter?  That's where a fixture generator comes in handy, but 
>> there are serious limitations to the
>> existing ones.  First consider the output of a class random fixture 
>> generator like Faker <https://github.com/marak/Faker.js/>
>>
>>   {
>>     "name":"Oswald Goldner",
>>     "username":"Izaiah",
>>     "email":"[email protected] <javascript:>",
>>     "address":{"zipcode":"35411"},
>>     "phone":"1-658-413-1550"
>>   }
>>
>> While the text for each individual field is reasonable, there's no 
>> overall coherency: the name ("Oswald...")
>> and the email ("Michael...") suggest two different people, the zip code 
>> suggests Alabama, the area code Oregon,
>> the email address isn't actually usable, etc.  There's another problem 
>> with using random data: it's nearly
>> impossible to reproduce.  Run the exact same test again and you could get 
>> totally different results.
>>
>> Now imagine a signup similar to the above that includes a password and a 
>> typical email confirmation step,
>> where you send an email to the user and they click on a link in the email 
>> that includes a temporary unique key
>> and requires them to reenter their password, then shows them a 
>> personalized "congratulations Oswald Goldner" page.
>> To test that scenario you need both an email address that works, some way 
>> to access that address's mailbox,
>> and some way of knowing or remembering the password and name across 
>> steps.  This becomes quite difficult using
>> transient random fixtures.
>>
>> Golems are a different approach to generating fixtures, using a 
>> deterministic but chaotic encryption algorithm
>> instead of a random number generator.  Realistic statistical data sets 
>> are used for demographic data and care is taken to ensure
>> consistency when possible, e.g. zip code and area code; year of birth, 
>> gender and first name (they correlate
>> surprisingly strongly, especially for females).  Here's an example:
>>
>>   {
>>     "gender":"female",
>>     "given_name":"Kimberly",
>>     "family_name":"West",
>>     "birth_date":"1979-05-23",
>>     "username":"g22yjght",
>>     "password":"2r%B0m%B",
>>     "email":"[email protected] <javascript:>",
>>     "address":{"postal_code":"94947"},
>>     "phone_number":"(707) 229-7163"
>>   }
>>
>> Every golem is grown from a single unique 32 bit number, given that 
>> number (and some keys and the right version of
>> the code you can fully recreate every attribute of the golem.  That 
>> number is directly derivable from a few of the
>> fields which are globally unique (username and email in this case), but 
>> can typically be recovered from a few pieces
>> of non-unique information (name and phone number would be enough).  To 
>> fullfill the second part of our
>> email-confirmation-round-trip test we actually need nothing more that the 
>> email address to which the confirmation
>> was sent, from this we can recreate the full golem, retrieve the 
>> password, and even verify that the user's
>> name is correctly displayed on the congratulations page.
>>
>> So where is this project at?  I've developed all of the core enabling 
>> technology, much of the low level stuff
>> is already made available on npm <https://npmjs.org/%7Efemto113> and 
>> github <https://github.com/femto113>.  There's  a prototype version of a 
>> service running on Heroku <http://api.golems.io/person/random.json> (
>> http://api.golems.io/person/random.json).
>> Some of the more advanced features (like an API to let you retrieve mail 
>> sent to golems) are in prototype state.
>> I've even created a zombie/golem hybrid (a glombie) that lets you use 
>> zombie.js to test websites without having 
>> to make up test data, and have outlined similar approaches that should 
>> work with stuff like phantomjs & casperjs 
>> <http://casperjs.org/api.html#casper.fill>
>>
>> There's a lot of detail I haven't been able to go into here, but if 
>> anybody is genuinely intrigued please
>> contact me, I'm happy to discuss.  I've always imagined that this would 
>> make a good freemium
>> service with a mixed open/closed source approach.  It might also make a 
>> great add-on or alternative to 
>> scale testing services like https://www.blitz.io/.  My first choice 
>> would be to find anyone interested in helping
>> make this into a viable service business, but failing that I'm open to 
>> open sourcing the whole thing as long
>> as there's someone (or some organization) that will actually carry it 
>> forward.
>>
>> --Ken
>>
>>  -- 
>> -- 
>> Job Board: http://jobs.nodejs.org/
>> Posting guidelines: 
>> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
>> You received this message because you are subscribed to the Google
>> Groups "nodejs" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/nodejs?hl=en?hl=en
>>  
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "nodejs" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>  
>>  
>>
>
>

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [nodejs] Announcement/Plea for help: Golems.io a Fixtures-As-A-Service solution

Reply via email to