We could create a single docker image with a wrapper that delegates to each data generator or create a separate docker image for each generator. Each data generator has its own separate of parameters that need to be specified though (e.g., numbers of customers and stores).
However, we could use Jay's transaction queue idea for all of the data generators. Evans, can you elaborate on the use case? If you want to write to HDFS, we would need a driver that pulls in some Hadoop libraries. Generally, we've been solving this by writing MapReduce or Spark jobs that call the libraries to parallelize and generate the data. E.g., data generators are just libraries that do computation while the driver handles parallelization and output. On Mon, Aug 31, 2015 at 9:44 AM, Jay Vyas <[email protected]> wrote: > Rj can we abstract the command line so that we have "one cli to rule them > all" into an interface? > > > > On Aug 31, 2015, at 10:40 AM, Evans Ye <[email protected]> wrote: > > > > I am very much like the shell script wrapper and docker image idea since > > that way we can integrate it directly with bigtop provisioner which > yield a > > perfect ux for the whole things. I think its not too hard to do it both, > we > > just need to add a parameter to turn the script into daemon mode. I see > > lots of image doing this way. > > > > docker run bigtop/bigtop-data-gen --scheme weather --size 5GB --output > > data-dir --etc foo --etc bar --daemon > > 2015年8月31日 下午9:06於 "RJ Nowling" <[email protected]>寫道: > > > >> The BigPetStore, Bazaar, and weather data generators have > single-threaded > >> command-line interfaces. We could do the same with the smaller > generators > >> (names, locations, etc.) if there is interest. > >> > >> On Mon, Aug 31, 2015 at 5:24 AM, Jay Vyas <[email protected]> > >> wrote: > >> > >>> Nate: Good idea to abstract the interface one level higher.... > >>> > >>> How about a docker run command ? That is probably the easiest way for > >>> Linux folks to run one off Java apps nowadays. > >>> > >>> docker run bigtop/bigtop-data-gen --scheme weather --size 5GB --output > >>> data-dir --etc foo --etc bar > >>> > >>> I'm happy to curate such a docker image, I already am doing something > >> like > >>> this in kube for bigtop-transaction-queue, which continuously pumps > data > >>> generator outputs into a REST endpoint or file > >>> Queue... So it could be extended to support other generators. > >>> > >>> > >>>> om> <[email protected]> wrote: > >>>> > >>>> Could picture at some point supporting something like this for non-jvm > >>> folk just looking for test/demo data: > >>>> > >>>> apt-get install bigtop-data-gen > >>>> ~/ $ bigtop-data-gen --scheme weather --size 5GB --output data-dir > >>> --etc foo --etc bar > >>>> > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: jay vyas [mailto:[email protected]] > >>>> Sent: Sunday, August 30, 2015 5:11 PM > >>>> To: [email protected] > >>>> Subject: Re: Proposal for "BigTop Data Generators" > >>>> > >>>> Hola nate. Well, here are the Use cases I know of that I have used > the > >>> data generators for. > >>>> > >>>> Dockerfile: > >>>> > >>>> (1) for testing kubernetes. For this, I just use transaction-queue > >>> docker file. > >>>> (2) for testing GlusterFS small file workloads, maybe with other > >>> analytics tools... > >>>> > >>>> Maven repo > >>>> > >>>> (3) Java maprduce/ignite/spark applications, which can just add a mvn > >>> repo when compiling. Java developers never add jars through RPM repos. > >>>> > >>>> RPM/DEB packages: > >>>> > >>>> I could see people using an RPM/DEB data generator, and I'm not > against > >>> it. But I simply don't know of any real world projects which > *currently* > >>> need RPM/Deb packages, which is why I haven't bothered to propose it > as a > >>> requirement. Nevertheless linux packages are always a welcome addition > >> if > >>> someone wants to create em ! > >>>> > >>>> > >>>> > >>>> > >>>>> On Sun, Aug 30, 2015 at 4:34 PM, <[email protected]> wrote: > >>>>> > >>>>> Would container be in addition to deb/rpm, or instead of? If latter > >>>>> can we do deb/rpm as base then have container either created from > them > >>>>> or directly from artifacts? > >>>>> > >>>>> On test usage side, seems could probably break up tests into > >>>>> base/required and then optional/add-on tests/test-suites. Think > >>>>> remember seeing mention of certain tests that are failing at times on > >>>>> certain component(s) anyways in the core builds but don’t mean that > >>>>> the build is broken, so would make sense to have some clean up around > >>> those anyways. > >>>>> > >>>>> -----Original Message----- > >>>>> From: RJ Nowling [mailto:[email protected]] > >>>>> Sent: Sunday, August 30, 2015 1:11 PM > >>>>> To: [email protected] > >>>>> Subject: Re: Proposal for "BigTop Data Generators" > >>>>> > >>>>> I agree with the above. :) > >>>>> > >>>>> On Sun, Aug 30, 2015 at 11:19 AM, Jay Vyas > >>>>> <[email protected]> > >>>>> wrote: > >>>>> > >>>>>> Hi RJ. > >>>>>> > >>>>>> Maven repositories and docker containers for the transaction queue > >>>>>> are good enough IMO. That will give people a way to compose them in > >>>>>> different idioms (one for Java folks, another for broader Linux > >>>>>> audience > >>>>> ). > >>>>>> > >>>>>> I think the lib designs are fairly intuitive. I would say that we > >>>>>> should constrain them all to being written in Java or Groovy to keep > >>>>>> the bigtop theme of "JVM for everything" :). > >>>>>> > >>>>>> Any particular questions you have around technical design can be > >>>>>> followed in a JIRA or else maybe a Readme spec that goes in a top > >>>>>> level of the data-generators dir... > >>>>>> > >>>>>>> On Aug 30, 2015, at 1:51 AM, RJ Nowling <[email protected]> > wrote: > >>>>>>> > >>>>>>> I'd like to keep this conversation going. > >>>>>>> > >>>>>>> So here are a few discussion points: > >>>>>>> > >>>>>>> 1. How do we want to make the data generators available? Maven? > >>>>>>> RPMs > >>>>>> and > >>>>>>> Debs? > >>>>>>> > >>>>>>> For now, I'm using a gradle multi-project build to easily build > >>>>>>> and > >>>>>> install > >>>>>>> the BPS data generators and its libraries into a local maven repo. > >>>>>>> This makes development easy. Eventually, I would like to post > >>>>>>> binaries > >>>>>> through > >>>>>>> Maven for easy integration by users. RPMs / Debs could be > >>>>>>> interesting since I use a pattern where the data generators are > >>>>>>> libraries (to support application integration / parallelization by > >>>>>>> the host framework) but also provide CLI drivers for local testing. > >>>>>>> > >>>>>>> 2. The idea of using the data generators as part of the smoke > >>>>>>> tests came up. Since there is concern about making the data > >>>>>>> generators required, we could offer the blueprints (BigPetStore) > >>>>>>> as optional smoke tests. Would that be a good compromise? > >>>>>>> > >>>>>>> 3. How will they be maintained? > >>>>>>> > >>>>>>> I'll certainly add myself to the maintainers list and will be > >>>>>>> taking responsibility. I'm happy to have others help as well if > >>>>>>> anyone wants to > >>>>>>> -- if not, that's cool, too. > >>>>>>> > >>>>>>> 4. Is anyone interested at all in discussing library APIs and > >> designs? > >>>>>>> What about internal interfaces and such? > >>>>>>> > >>>>>>> > >>>>>>> My plan was to add at least one more data generator (weather > >>>>>>> simulator) > >>>>>> to > >>>>>>> bigtop-data-generators in the short term. However, given the > >>>>>>> concerns raised by Cos (more discussion needed) and Olaf (don't > >>>>>>> want to force data generators on unsuspecting users ;) ), I would > >>>>>>> like to reach some > >>>>>> consensus > >>>>>>> on what people are concerned about and solutions. > >>>>>>> > >>>>>>> On Thu, Aug 27, 2015 at 12:38 PM, Konstantin Boudnik > >>>>>>> <[email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>>> Fine by me. I have linked this thread to the JIRA ticket that RJ > >>>>>> created, > >>>>>>>> so > >>>>>>>> we have a way to connect one to another ;) > >>>>>>>> > >>>>>>>>> On Thu, Aug 27, 2015 at 01:02PM, Olaf Flebbe wrote: > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> I am not confident that moving important design discussions with > >>>>>>>>> impact > >>>>>>>> to > >>>>>>>>> the whole project to jira is a good idea. > >>>>>>>>> > >>>>>>>>> In the current JIRA Traffic storm it is not easy to identify and > >>>>>>>>> follow > >>>>>>>> important tickets. > >>>>>>>>> > >>>>>>>>> Please keep discussions on the list or at least, please state on > >>>>>>>>> this > >>>>>>>> list which Ticket to follow ... > >>>>>>>>> > >>>>>>>>> Olaf > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik < > >> [email protected] > >>>> : > >>>>>>>>>> > >>>>>>>>>>> On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote: > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> Nive to have data generators in Bigtop. > >>>>>>>>>>> > >>>>>>>>>>> But please do not include it in bigtop_utils, since this > >>>>>>>>>>> package is mandatory. Not everyone needs a data generator . > >>>>>>>>>> > >>>>>>>>>> Yup. And let's move further design discussion to the JIRA! > >>>>>>>>>> > >>>>>>>>>>> Olaf > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> Am 26.08.2015 um 11:25 schrieb Jay Vyas < > >>>>>> [email protected] > >>>>>>>>> : > >>>>>>>>>>>> > >>>>>>>>>>>> Publishing the jar to bigtops maven is probably a good first > >>>>>>>>>>>> step > >>>>>>>> ,Then apps can just include it as needed...?. > >>>>>>>>>>>> > >>>>>>>>>>>> I'm not against packaging if someone wants packages for this. > >>>>>>>>>>>> Maybe > >>>>>>>> even include it in bigtop util ? > >>>>>>>>>>>> > >>>>>>>>>>>> Let's move to jira, > >>>>>>>>>>>> > >>>>>>>>>>>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik > >>>>>>>>>>>>> <[email protected]> > >>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> It is pretty cool indeed! > >>>>>>>>>>>>> > >>>>>>>>>>>>> I wonder how it needs to be structured to be: > >>>>>>>>>>>>> - easy to access/use from other components wherever it is > >>>>>>>>>>>>> needed > >>>>>>>>>>>>> - doesn't interfere with the rest of the stack > >>>>>>>>>>>>> > >>>>>>>>>>>>> I guess one possible way would be to implement the generator > >>>>>>>>>>>>> as a > >>>>>>>> set of maven > >>>>>>>>>>>>> artifacts, that could be installed/consumed transparently by > >>>>>>>>>>>>> just > >>>>>>>> declaring a > >>>>>>>>>>>>> dependency e.g as proposed via top-level component. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Another way is to have a new package like we do for > >>>>>>>>>>>>> bigtop-utils > >>>>>>>> and such. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Perhaps this discussion should be moved to JIRA or shall we > >>>>>>>> continue on the > >>>>>>>>>>>>> dev@ ?? > >>>>>>>>>>>>> > >>>>>>>>>>>>> Cos > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote: > >>>>>>>>>>>>>> Hi BigTop, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I had a discussion with Jay yesterday, we'd like to propose > >>>>>>>>>>>>>> a new > >>>>>>>> component > >>>>>>>>>>>>>> for BigTop: BigTop Data Generators. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> BigTop Data Generators would consist of a common set of > >>>>>>>>>>>>>> libraries > >>>>>>>> for > >>>>>>>>>>>>>> building data generators and three example data generators: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> * BigPetStore transaction generator (moved from > >>>>>>>>>>>>>> BigPetStore) > >>>>>>>>>>>>>> * BigTop Bazaar -- attendee movement and interactions with > >>>>>>>>>>>>>> booths > >>>>>>>> on a > >>>>>>>>>>>>>> showroom floor, at a conference, or at a mall > >>>>>>>>>>>>>> * BigTop Weatherman -- stochastic weather simulation > >>>>>>>> (temperature, wind > >>>>>>>>>>>>>> speed, wind chill, rainfall, etc.) per zip code. (From a > >>>>>>>>>>>>>> model > >>>>>>>> trained on > >>>>>>>>>>>>>> NOAA historical weather data) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We believe that creating a common set of libraries will > >>>>>>>>>>>>>> have > >>>>>>>> several > >>>>>>>>>>>>>> benefits including: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> * Easier for others to build their own data generators > >>>>>>>>>>>>>> * Make data generators smaller and easier to maintain > >>>>>>>>>>>>>> * Share improvements across the data generators > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> More details on the libraries are below. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> BigPetStore will be continue to focus on building and > >>>>>>>>>>>>>> maintaining blueprints, powered by the BigTop Data > >> Generators. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Our vision is that we get all of Apache coming to BigTop > >>>>>>>>>>>>>> for tools > >>>>>>>> for > >>>>>>>>>>>>>> building better, more comprehensive blueprints. We want to > >>>>>>>> support these > >>>>>>>>>>>>>> efforts through data generators and the initial set of > >>>>>>>>>>>>>> blueprint > >>>>>>>> we've been > >>>>>>>>>>>>>> building. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> If the community is generally in support of this, I can > >>>>>>>>>>>>>> create a > >>>>>>>> top-level > >>>>>>>>>>>>>> "bigtop-data-generators" directory and put the data > >>>>>>>>>>>>>> generators and libraries in there. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> RJ > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> ------- > >>>>>>>>>>>>>> Library details: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> So far, I've extracted the following common libraries: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> * Samplers -- provides classes for PDFs and various > >>>>>>>>>>>>>> samplers > >>>>>>>>>>>>>> * Name generator -- data set and samplers for generating > >>>>>>>>>>>>>> names > >>>>>>>>>>>>>> * Location data set -- data set and classes for US zip > >>>>>>>>>>>>>> codes, > >>>>>>>> their > >>>>>>>>>>>>>> GPS coordinates, median house hold incomes, and population > >>>>>>>>>>>>>> sizes > >>>>>>>>>>>>>> * Product generator -- library for enumerating products > >>>>>>>>>>>>>> from a specification file. Comes with default > >>>>>>>>>>>>>> specifications for > >>>>>>>> BigPetStore > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I also expect that I'll add libraries for: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> * Particle simulation -- customer movement in a room > >>>>>>>>>>>>>> * Latent factor model generation -- generate latent > >>>>>>>>>>>>>> factors and customer weights to create something like > >>> MovieLens data. > >>>>>>>>>>>>>> Used in > >>>>>>>> Bazaar > >>>>>>>>>>>>>> for booth preferences and potentially in BigPetStore for > >>>>>>>>>>>>>> customer > >>>>>>>> item > >>>>>>>>>>>>>> preferences > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Most of these libraries came out of the BigPetStore data > >>>>>>>>>>>>>> generator > >>>>>>>> but the > >>>>>>>>>>>>>> other generators have been refactored to be based off the > >>>>>>>>>>>>>> standard > >>>>>>>> set of > >>>>>>>>>>>>>> libraries. > >>>> > >>>> > >>>> -- > >>>> jay vyas > >> >
