Glad to hear there is some interest. Here is a JIRA to take it further. https://issues.apache.org/jira/browse/BIGTOP-1212
@Cos, we need something flexible enough to do differnt types of data sets,and possibly embed patterns in the data, do you know of any place to start ? is GridMix, for example, or SLive, pluggable in that way? If not we might have to hack our own together. Maybe respond in BIGTOP-1212 above. On Sat, Feb 15, 2014 at 9:47 PM, Konstantin Boudnik <[email protected]> wrote: > Neat idea! I think the answer depends on what kinda data we want to > generate. > - I had a good run with gridmix for variery of longevity loads (too bad > Cloudera never released the code to open source). > - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and > BIGTOP-1209) > are pretty much ready, it seems > > At any rate, I'd rather prefer to incorporate something readily available > that > has good community behind it, so we won't end up supporting an big chunk of > specialized software. > > So, what do you have in mind? Any details? > Cos > > On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote: > > Hi bigtop. Are we interested in maintaining our own infra for generating > > fake data , rather than relying on and downloading external data sources > for > > smokes? Fake data is great for testing I think... > > > > In bigpetstore I'm generating fake data , written a lot of code to do > this > > in the custom input formats.... but I just found : > > > > http://codearte.github.io/jfairy/ > > > > Which is a groovy tool for doing the same.... > > > > I wonder wether generating fake data for testing big data should be a > > first-class part of bigtop ? Would others use a utility or just me ? > > > > It might be another useful artifact for the community especially for > > bigpetstore but also for testing a variety of other machine learning > related > > projects.... > > > > I think it's bad to rely on external websites for our tests, maybe in > time > > we could move over to our in internally curated/generated data sets , > and a > > data generation tool like the above moves us in that direction. > > -- Jay Vyas http://jayunit100.blogspot.com
