On Sat, Feb 15, 2014 at 10:24PM, Jay Vyas wrote: > Glad to hear there is some interest. Here is a JIRA to take it further. > > https://issues.apache.org/jira/browse/BIGTOP-1212 > > @Cos, we need something flexible enough to do differnt types of data > sets,and possibly embed patterns in the data, do you know of any place to > start ? is GridMix, for example, or SLive, pluggable in that way?
I don't think either of these would work really. Let's investigate. > If not we might have to hack our own together. > > Maybe respond in BIGTOP-1212 above. > > > On Sat, Feb 15, 2014 at 9:47 PM, Konstantin Boudnik <[email protected]> wrote: > > > Neat idea! I think the answer depends on what kinda data we want to > > generate. > > - I had a good run with gridmix for variery of longevity loads (too bad > > Cloudera never released the code to open source). > > - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and > > BIGTOP-1209) > > are pretty much ready, it seems > > > > At any rate, I'd rather prefer to incorporate something readily available > > that > > has good community behind it, so we won't end up supporting an big chunk of > > specialized software. > > > > So, what do you have in mind? Any details? > > Cos > > > > On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote: > > > Hi bigtop. Are we interested in maintaining our own infra for generating > > > fake data , rather than relying on and downloading external data sources > > for > > > smokes? Fake data is great for testing I think... > > > > > > In bigpetstore I'm generating fake data , written a lot of code to do > > this > > > in the custom input formats.... but I just found : > > > > > > http://codearte.github.io/jfairy/ > > > > > > Which is a groovy tool for doing the same.... > > > > > > I wonder wether generating fake data for testing big data should be a > > > first-class part of bigtop ? Would others use a utility or just me ? > > > > > > It might be another useful artifact for the community especially for > > > bigpetstore but also for testing a variety of other machine learning > > related > > > projects.... > > > > > > I think it's bad to rely on external websites for our tests, maybe in > > time > > > we could move over to our in internally curated/generated data sets , > > and a > > > data generation tool like the above moves us in that direction. > > > > > > > -- > Jay Vyas > http://jayunit100.blogspot.com
signature.asc
Description: Digital signature
