Glad to hear there is some interest.  Here is a JIRA to take it further.

https://issues.apache.org/jira/browse/BIGTOP-1212

@Cos, we need something flexible enough to do differnt types of data
sets,and possibly embed patterns in the data, do you know of any place to
start ? is GridMix, for example, or SLive, pluggable in that way?

If not we might have to hack our own together.

Maybe respond in BIGTOP-1212 above.


On Sat, Feb 15, 2014 at 9:47 PM, Konstantin Boudnik <[email protected]> wrote:

> Neat idea! I think the answer depends on what kinda data we want to
> generate.
>  - I had a good run with gridmix for variery of longevity loads (too bad
>    Cloudera never released the code to open source).
>  - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and
> BIGTOP-1209)
>    are pretty much ready, it seems
>
> At any rate, I'd rather prefer to incorporate something readily available
> that
> has good community behind it, so we won't end up supporting an big chunk of
> specialized software.
>
> So, what do you have in mind? Any details?
>   Cos
>
> On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote:
> > Hi bigtop.  Are we interested in maintaining our own infra for generating
> > fake data , rather than relying on and downloading external data sources
> for
> > smokes?  Fake data is great for testing I think...
> >
> > In bigpetstore I'm generating fake data , written a lot of code to do
> this
> > in the custom input formats.... but I just found :
> >
> > http://codearte.github.io/jfairy/
> >
> > Which is a groovy tool for doing the same....
> >
> >   I wonder wether generating fake data for testing big data should be a
> >   first-class part of bigtop ?  Would others use a utility or just me ?
> >
> > It might be another useful artifact for the community especially for
> > bigpetstore but also for testing a variety of other machine learning
> related
> > projects....
> >
> > I think it's bad to rely on external websites for our tests, maybe in
> time
> > we could move over to our in internally curated/generated data sets ,
> and a
> > data generation tool like the above moves us in that direction.
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Reply via email to