Request for test data based off of obfuscated live data

John Palmieri Tue, 18 Nov 2008 11:01:19 -0800

Hey guys,

On IRC the other day there was a discussion where I had requested the ability 
to use live data stripped of all personal identification and data, for creating 
a test bed for development of MyFedora.  It was asked I write up a reason for 
needing this data as it was hard to explain in detail on IRC.


Current Development

First lets go over my current development process.  Right now I work on live 
data.  Since most of my code involves reading data this isn't an issue except 
for perhaps putting a load on the servers when testing. This however becomes 
less ideal as I need to test code that modifies data, such as pushing a build.  
Even more daunting is if I need to add functionality to one of the other apps, 
creating data that would somewhat reflect the real word is time consuming and 
often a blocker which has me move to something else.

Why I need the data

So why is it important to have real world data - or at least a semblance of it? 
 Working on something that will consolidate a lot of the data into one 
interface, I hit a vast majority of our infrastructure while treating it as one 
entity.  If each piece of infrastructure lived in isolation, it wouldn't be as 
big of an issue but as it stands the data has keys which link each record in 
one piece of infrastructure to a record in another.  For instance Fas usernames 
link to builds in Koji who's build numbers link to releases in Bodhi.  I need 
data with those links intact so I can follow the workflow from one tool to 
another, test access rights and simulate the progression of various data 
through the pieces of infrastructure without worrying about stomping on the 
data because I can quickly restore it to its initial state.  Also, I can't hit 
every edge case, I need to concentrate on how the data most commonly flows and 
having something that resembles what we see on the production s
 ervers is key there.

What I am asking for   

As stated above, I would like a data set representing the data one would see in 
our infrastructure.  Ideally this would mean a secure process that would dump 
data from koji, bodhi, fas and pkgdb while obfuscating all personally 
identifying data.  This could include switching package owners and uids at 
random so as not to be able to trace the data back (though in reality one could 
gather this data slowly by querying each of the infrastructure pieces). I only 
need a relatively small sampling of say a months worth of data and a semi 
random drawing of the most active contributors and their packages.  I can 
update dates to keep the data "current" for testing purposes.  Every once in 
awhile I would need a fresh sampling to make sure the code didn't just work 
with my sample set. 

Why pure random data isn't sufficient

Random data does not produce the relationships needed to work with the entire 
fedora infrastructure and even if it did the data would not cover real world 
scenarios and most likely the relationships would be largely invalid (like a 
build tagged for F-8 released in F-9).  Also things like koji tags and group 
information need to absolutely conform to the structure we have setup.  For 
instance I key off of the string "updates-candidate" to determine if I should 
show a button to push the build to bodhi.  The button also relies on FAS 
telling bodhi that the current logged in user is in the correct group to push.  
If it is not an updates candidate or the user is not in the correct group, the 
button does not show.

What I would do with this data

I would be able to accelerate development of the more interesting bits of 
myfedora while also being able to experiment and quickly produce patches to 
various bits of infrastructure.  For instance, FAS already had all the API I 
need to edit my profile except it is not exposed outside of fas because of the 
lack of a simple @allow_json decorator so I had to drop that feature until 
after the development freeze and a new FAS with the patch is put into 
production.  Even then modifying data on a production server, even if it is my 
own profile, is not an ideal way to test.  If I had a data set I could set up 
my own test environment, apply the patch and test before we deploy.  I could 
then go and patch other parts of the infrastructure to say speed up a query, 
add queries I needed and generally improve the base infrastructure as I 
developed MyFedora.  The patches would then be sent to trac and accepted or 
rejected in the usual manner.

Others could also more easily get into hacking on infrastructure bits as they 
would have a place to start instead of a daunting blank slate.  If I can get 
the data I am more than happy to write scripts and kickstart files to easily 
setup and teardown a Fedora Infrastructure test and development instance.

Whatever solution the infrastructure team thinks is good for what I need will 
be workable.  Above is what I think I need and an explanation on why it is 
needed.  Hopefully there will be some solution we can agree on to move forward 
fairly quickly.  Thanks for your time.

--
John (J5) Palmieri
Software Engineer
Red Hat, Inc.

_______________________________________________
Fedora-infrastructure-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list

Request for test data based off of obfuscated live data

Reply via email to