Neat idea! I think the answer depends on what kinda data we want to generate.
 - I had a good run with gridmix for variery of longevity loads (too bad
   Cloudera never released the code to open source).
 - for HDFS testing we can use SLive and DFSIO (BIGTOP-1208 and BIGTOP-1209)
   are pretty much ready, it seems

At any rate, I'd rather prefer to incorporate something readily available that
has good community behind it, so we won't end up supporting an big chunk of
specialized software.

So, what do you have in mind? Any details?
  Cos

On Sat, Feb 15, 2014 at 09:19AM, Jay Vyas wrote:
> Hi bigtop.  Are we interested in maintaining our own infra for generating
> fake data , rather than relying on and downloading external data sources for
> smokes?  Fake data is great for testing I think...  
> 
> In bigpetstore I'm generating fake data , written a lot of code to do this
> in the custom input formats.... but I just found :
> 
> http://codearte.github.io/jfairy/
> 
> Which is a groovy tool for doing the same....
> 
>   I wonder wether generating fake data for testing big data should be a
>   first-class part of bigtop ?  Would others use a utility or just me ?
> 
> It might be another useful artifact for the community especially for
> bigpetstore but also for testing a variety of other machine learning related
> projects....
> 
> I think it's bad to rely on external websites for our tests, maybe in time
> we could move over to our in internally curated/generated data sets , and a
> data generation tool like the above moves us in that direction.

Attachment: signature.asc
Description: Digital signature

Reply via email to