Let's kick off the design discussion here. I'll provide some background on the current state of affairs.
The BigPetStore data generator is a library with 5-6 top-level classes which generate different types (stores, customers, products, purchasing profiles, and transactions) of data. I designed it this way so that each part could be parallelized / distributed using a framework of the user's choice (Java threads, Spark, Hadoop, etc.). The data generator takes a handful of parameters: number of stores, number of customers, number of purchasing profiles, and the number of days to simulate. (Everything else is defined as constants in a Constants class.) The BPS Spark driver uses this library to generate data. Jay has also created a "transaction queue" which uses Java threads and pushes the data to a rotating file queue or via HTTP. The BPS data generator does have a non-threaded, serial CLI driver built-in that I've used for testing and which generates CSV files: java -jar bigpetstore-data-generator-1.1.0-SNAPSHOT.jar output/ ... My anticipation is to do the same for the others. Implement them as libraries with a small set of parameters, if necessary. The libraries can be composed in that they can depend on one another. I'll provide CLI interfaces for simple use cases. All data are packaged with the libraries to prevent having to track a bunch of data files. These libraries should be easily callable from any Groovy or Java tests. Likewise, we can write Hadoop / Spark / etc. wrappers for them to be copied onto the cluster for data generation there. Please let me know what you guys think. On Thu, Aug 27, 2015 at 6:02 AM, Olaf Flebbe <[email protected]> wrote: > Hi, > > I am not confident that moving important design discussions with impact to > the whole project to jira is a good idea. > > In the current JIRA Traffic storm it is not easy to identify and follow > important tickets. > > Please keep discussions on the list or at least, please state on this list > which Ticket to follow ... > > Olaf > > > > > Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik <[email protected]>: > > > > On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote: > >> Hi, > >> > >> Nive to have data generators in Bigtop. > >> > >> But please do not include it in bigtop_utils, since this package is > >> mandatory. Not everyone needs a data generator . > > > > Yup. And let's move further design discussion to the JIRA! > > > >> Olaf > >> > >> > >>> Am 26.08.2015 um 11:25 schrieb Jay Vyas <[email protected]>: > >>> > >>> Publishing the jar to bigtops maven is probably a good first step > ,Then apps can just include it as needed...?. > >>> > >>> I'm not against packaging if someone wants packages for this. Maybe > even include it in bigtop util ? > >>> > >>> Let's move to jira, > >>> > >>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik <[email protected]> > wrote: > >>>> > >>>> It is pretty cool indeed! > >>>> > >>>> I wonder how it needs to be structured to be: > >>>> - easy to access/use from other components wherever it is needed > >>>> - doesn't interfere with the rest of the stack > >>>> > >>>> I guess one possible way would be to implement the generator as a set > of maven > >>>> artifacts, that could be installed/consumed transparently by just > declaring a > >>>> dependency e.g as proposed via top-level component. > >>>> > >>>> Another way is to have a new package like we do for bigtop-utils and > such. > >>>> > >>>> Perhaps this discussion should be moved to JIRA or shall we continue > on the > >>>> dev@ ?? > >>>> > >>>> Cos > >>>> > >>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote: > >>>>> Hi BigTop, > >>>>> > >>>>> I had a discussion with Jay yesterday, we'd like to propose a new > component > >>>>> for BigTop: BigTop Data Generators. > >>>>> > >>>>> BigTop Data Generators would consist of a common set of libraries for > >>>>> building data generators and three example data generators: > >>>>> > >>>>> * BigPetStore transaction generator (moved from BigPetStore) > >>>>> * BigTop Bazaar -- attendee movement and interactions with booths > on a > >>>>> showroom floor, at a conference, or at a mall > >>>>> * BigTop Weatherman -- stochastic weather simulation (temperature, > wind > >>>>> speed, wind chill, rainfall, etc.) per zip code. (From a model > trained on > >>>>> NOAA historical weather data) > >>>>> > >>>>> We believe that creating a common set of libraries will have several > >>>>> benefits including: > >>>>> > >>>>> * Easier for others to build their own data generators > >>>>> * Make data generators smaller and easier to maintain > >>>>> * Share improvements across the data generators > >>>>> > >>>>> More details on the libraries are below. > >>>>> > >>>>> BigPetStore will be continue to focus on building and maintaining > >>>>> blueprints, powered by the BigTop Data Generators. > >>>>> > >>>>> Our vision is that we get all of Apache coming to BigTop for tools > for > >>>>> building better, more comprehensive blueprints. We want to support > these > >>>>> efforts through data generators and the initial set of blueprint > we've been > >>>>> building. > >>>>> > >>>>> If the community is generally in support of this, I can create a > top-level > >>>>> "bigtop-data-generators" directory and put the data generators and > >>>>> libraries in there. > >>>>> > >>>>> Thanks! > >>>>> > >>>>> RJ > >>>>> > >>>>> > >>>>> ------- > >>>>> Library details: > >>>>> > >>>>> So far, I've extracted the following common libraries: > >>>>> > >>>>> * Samplers -- provides classes for PDFs and various samplers > >>>>> * Name generator -- data set and samplers for generating names > >>>>> * Location data set -- data set and classes for US zip codes, their > >>>>> GPS coordinates, median house hold incomes, and population sizes > >>>>> * Product generator -- library for enumerating products from a > >>>>> specification file. Comes with default specifications for > BigPetStore > >>>>> > >>>>> I also expect that I'll add libraries for: > >>>>> > >>>>> * Particle simulation -- customer movement in a room > >>>>> * Latent factor model generation -- generate latent factors and > >>>>> customer weights to create something like MovieLens data. Used in > Bazaar > >>>>> for booth preferences and potentially in BigPetStore for customer > item > >>>>> preferences > >>>>> > >>>>> Most of these libraries came out of the BigPetStore data generator > but the > >>>>> other generators have been refactored to be based off the standard > set of > >>>>> libraries. > >> > > > > > >
