The BigPetStore, Bazaar, and weather data generators have single-threaded command-line interfaces. We could do the same with the smaller generators (names, locations, etc.) if there is interest.
On Mon, Aug 31, 2015 at 5:24 AM, Jay Vyas <[email protected]> wrote: > Nate: Good idea to abstract the interface one level higher.... > > How about a docker run command ? That is probably the easiest way for > Linux folks to run one off Java apps nowadays. > > docker run bigtop/bigtop-data-gen --scheme weather --size 5GB --output > data-dir --etc foo --etc bar > > I'm happy to curate such a docker image, I already am doing something like > this in kube for bigtop-transaction-queue, which continuously pumps data > generator outputs into a REST endpoint or file > Queue... So it could be extended to support other generators. > > > > om> <[email protected]> wrote: > > > > Could picture at some point supporting something like this for non-jvm > folk just looking for test/demo data: > > > > apt-get install bigtop-data-gen > > ~/ $ bigtop-data-gen --scheme weather --size 5GB --output data-dir > --etc foo --etc bar > > > > > > > > -----Original Message----- > > From: jay vyas [mailto:[email protected]] > > Sent: Sunday, August 30, 2015 5:11 PM > > To: [email protected] > > Subject: Re: Proposal for "BigTop Data Generators" > > > > Hola nate. Well, here are the Use cases I know of that I have used the > data generators for. > > > > Dockerfile: > > > > (1) for testing kubernetes. For this, I just use transaction-queue > docker file. > > (2) for testing GlusterFS small file workloads, maybe with other > analytics tools... > > > > Maven repo > > > > (3) Java maprduce/ignite/spark applications, which can just add a mvn > repo when compiling. Java developers never add jars through RPM repos. > > > > RPM/DEB packages: > > > > I could see people using an RPM/DEB data generator, and I'm not against > it. But I simply don't know of any real world projects which *currently* > need RPM/Deb packages, which is why I haven't bothered to propose it as a > requirement. Nevertheless linux packages are always a welcome addition if > someone wants to create em ! > > > > > > > > > >> On Sun, Aug 30, 2015 at 4:34 PM, <[email protected]> wrote: > >> > >> Would container be in addition to deb/rpm, or instead of? If latter > >> can we do deb/rpm as base then have container either created from them > >> or directly from artifacts? > >> > >> On test usage side, seems could probably break up tests into > >> base/required and then optional/add-on tests/test-suites. Think > >> remember seeing mention of certain tests that are failing at times on > >> certain component(s) anyways in the core builds but don’t mean that > >> the build is broken, so would make sense to have some clean up around > those anyways. > >> > >> -----Original Message----- > >> From: RJ Nowling [mailto:[email protected]] > >> Sent: Sunday, August 30, 2015 1:11 PM > >> To: [email protected] > >> Subject: Re: Proposal for "BigTop Data Generators" > >> > >> I agree with the above. :) > >> > >> On Sun, Aug 30, 2015 at 11:19 AM, Jay Vyas > >> <[email protected]> > >> wrote: > >> > >>> Hi RJ. > >>> > >>> Maven repositories and docker containers for the transaction queue > >>> are good enough IMO. That will give people a way to compose them in > >>> different idioms (one for Java folks, another for broader Linux > >>> audience > >> ). > >>> > >>> I think the lib designs are fairly intuitive. I would say that we > >>> should constrain them all to being written in Java or Groovy to keep > >>> the bigtop theme of "JVM for everything" :). > >>> > >>> Any particular questions you have around technical design can be > >>> followed in a JIRA or else maybe a Readme spec that goes in a top > >>> level of the data-generators dir... > >>> > >>>> On Aug 30, 2015, at 1:51 AM, RJ Nowling <[email protected]> wrote: > >>>> > >>>> I'd like to keep this conversation going. > >>>> > >>>> So here are a few discussion points: > >>>> > >>>> 1. How do we want to make the data generators available? Maven? > >>>> RPMs > >>> and > >>>> Debs? > >>>> > >>>> For now, I'm using a gradle multi-project build to easily build > >>>> and > >>> install > >>>> the BPS data generators and its libraries into a local maven repo. > >>>> This makes development easy. Eventually, I would like to post > >>>> binaries > >>> through > >>>> Maven for easy integration by users. RPMs / Debs could be > >>>> interesting since I use a pattern where the data generators are > >>>> libraries (to support application integration / parallelization by > >>>> the host framework) but also provide CLI drivers for local testing. > >>>> > >>>> 2. The idea of using the data generators as part of the smoke > >>>> tests came up. Since there is concern about making the data > >>>> generators required, we could offer the blueprints (BigPetStore) > >>>> as optional smoke tests. Would that be a good compromise? > >>>> > >>>> 3. How will they be maintained? > >>>> > >>>> I'll certainly add myself to the maintainers list and will be > >>>> taking responsibility. I'm happy to have others help as well if > >>>> anyone wants to > >>>> -- if not, that's cool, too. > >>>> > >>>> 4. Is anyone interested at all in discussing library APIs and designs? > >>>> What about internal interfaces and such? > >>>> > >>>> > >>>> My plan was to add at least one more data generator (weather > >>>> simulator) > >>> to > >>>> bigtop-data-generators in the short term. However, given the > >>>> concerns raised by Cos (more discussion needed) and Olaf (don't > >>>> want to force data generators on unsuspecting users ;) ), I would > >>>> like to reach some > >>> consensus > >>>> on what people are concerned about and solutions. > >>>> > >>>> On Thu, Aug 27, 2015 at 12:38 PM, Konstantin Boudnik > >>>> <[email protected]> > >>> wrote: > >>>> > >>>>> Fine by me. I have linked this thread to the JIRA ticket that RJ > >>> created, > >>>>> so > >>>>> we have a way to connect one to another ;) > >>>>> > >>>>>> On Thu, Aug 27, 2015 at 01:02PM, Olaf Flebbe wrote: > >>>>>> Hi, > >>>>>> > >>>>>> I am not confident that moving important design discussions with > >>>>>> impact > >>>>> to > >>>>>> the whole project to jira is a good idea. > >>>>>> > >>>>>> In the current JIRA Traffic storm it is not easy to identify and > >>>>>> follow > >>>>> important tickets. > >>>>>> > >>>>>> Please keep discussions on the list or at least, please state on > >>>>>> this > >>>>> list which Ticket to follow ... > >>>>>> > >>>>>> Olaf > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik <[email protected] > >: > >>>>>>> > >>>>>>> On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Nive to have data generators in Bigtop. > >>>>>>>> > >>>>>>>> But please do not include it in bigtop_utils, since this > >>>>>>>> package is mandatory. Not everyone needs a data generator . > >>>>>>> > >>>>>>> Yup. And let's move further design discussion to the JIRA! > >>>>>>> > >>>>>>>> Olaf > >>>>>>>> > >>>>>>>> > >>>>>>>>> Am 26.08.2015 um 11:25 schrieb Jay Vyas < > >>> [email protected] > >>>>>> : > >>>>>>>>> > >>>>>>>>> Publishing the jar to bigtops maven is probably a good first > >>>>>>>>> step > >>>>> ,Then apps can just include it as needed...?. > >>>>>>>>> > >>>>>>>>> I'm not against packaging if someone wants packages for this. > >>>>>>>>> Maybe > >>>>> even include it in bigtop util ? > >>>>>>>>> > >>>>>>>>> Let's move to jira, > >>>>>>>>> > >>>>>>>>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik > >>>>>>>>>> <[email protected]> > >>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> It is pretty cool indeed! > >>>>>>>>>> > >>>>>>>>>> I wonder how it needs to be structured to be: > >>>>>>>>>> - easy to access/use from other components wherever it is > >>>>>>>>>> needed > >>>>>>>>>> - doesn't interfere with the rest of the stack > >>>>>>>>>> > >>>>>>>>>> I guess one possible way would be to implement the generator > >>>>>>>>>> as a > >>>>> set of maven > >>>>>>>>>> artifacts, that could be installed/consumed transparently by > >>>>>>>>>> just > >>>>> declaring a > >>>>>>>>>> dependency e.g as proposed via top-level component. > >>>>>>>>>> > >>>>>>>>>> Another way is to have a new package like we do for > >>>>>>>>>> bigtop-utils > >>>>> and such. > >>>>>>>>>> > >>>>>>>>>> Perhaps this discussion should be moved to JIRA or shall we > >>>>> continue on the > >>>>>>>>>> dev@ ?? > >>>>>>>>>> > >>>>>>>>>> Cos > >>>>>>>>>> > >>>>>>>>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote: > >>>>>>>>>>> Hi BigTop, > >>>>>>>>>>> > >>>>>>>>>>> I had a discussion with Jay yesterday, we'd like to propose > >>>>>>>>>>> a new > >>>>> component > >>>>>>>>>>> for BigTop: BigTop Data Generators. > >>>>>>>>>>> > >>>>>>>>>>> BigTop Data Generators would consist of a common set of > >>>>>>>>>>> libraries > >>>>> for > >>>>>>>>>>> building data generators and three example data generators: > >>>>>>>>>>> > >>>>>>>>>>> * BigPetStore transaction generator (moved from > >>>>>>>>>>> BigPetStore) > >>>>>>>>>>> * BigTop Bazaar -- attendee movement and interactions with > >>>>>>>>>>> booths > >>>>> on a > >>>>>>>>>>> showroom floor, at a conference, or at a mall > >>>>>>>>>>> * BigTop Weatherman -- stochastic weather simulation > >>>>> (temperature, wind > >>>>>>>>>>> speed, wind chill, rainfall, etc.) per zip code. (From a > >>>>>>>>>>> model > >>>>> trained on > >>>>>>>>>>> NOAA historical weather data) > >>>>>>>>>>> > >>>>>>>>>>> We believe that creating a common set of libraries will > >>>>>>>>>>> have > >>>>> several > >>>>>>>>>>> benefits including: > >>>>>>>>>>> > >>>>>>>>>>> * Easier for others to build their own data generators > >>>>>>>>>>> * Make data generators smaller and easier to maintain > >>>>>>>>>>> * Share improvements across the data generators > >>>>>>>>>>> > >>>>>>>>>>> More details on the libraries are below. > >>>>>>>>>>> > >>>>>>>>>>> BigPetStore will be continue to focus on building and > >>>>>>>>>>> maintaining blueprints, powered by the BigTop Data Generators. > >>>>>>>>>>> > >>>>>>>>>>> Our vision is that we get all of Apache coming to BigTop > >>>>>>>>>>> for tools > >>>>> for > >>>>>>>>>>> building better, more comprehensive blueprints. We want to > >>>>> support these > >>>>>>>>>>> efforts through data generators and the initial set of > >>>>>>>>>>> blueprint > >>>>> we've been > >>>>>>>>>>> building. > >>>>>>>>>>> > >>>>>>>>>>> If the community is generally in support of this, I can > >>>>>>>>>>> create a > >>>>> top-level > >>>>>>>>>>> "bigtop-data-generators" directory and put the data > >>>>>>>>>>> generators and libraries in there. > >>>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>> > >>>>>>>>>>> RJ > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> ------- > >>>>>>>>>>> Library details: > >>>>>>>>>>> > >>>>>>>>>>> So far, I've extracted the following common libraries: > >>>>>>>>>>> > >>>>>>>>>>> * Samplers -- provides classes for PDFs and various > >>>>>>>>>>> samplers > >>>>>>>>>>> * Name generator -- data set and samplers for generating > >>>>>>>>>>> names > >>>>>>>>>>> * Location data set -- data set and classes for US zip > >>>>>>>>>>> codes, > >>>>> their > >>>>>>>>>>> GPS coordinates, median house hold incomes, and population > >>>>>>>>>>> sizes > >>>>>>>>>>> * Product generator -- library for enumerating products > >>>>>>>>>>> from a specification file. Comes with default > >>>>>>>>>>> specifications for > >>>>> BigPetStore > >>>>>>>>>>> > >>>>>>>>>>> I also expect that I'll add libraries for: > >>>>>>>>>>> > >>>>>>>>>>> * Particle simulation -- customer movement in a room > >>>>>>>>>>> * Latent factor model generation -- generate latent > >>>>>>>>>>> factors and customer weights to create something like > MovieLens data. > >>>>>>>>>>> Used in > >>>>> Bazaar > >>>>>>>>>>> for booth preferences and potentially in BigPetStore for > >>>>>>>>>>> customer > >>>>> item > >>>>>>>>>>> preferences > >>>>>>>>>>> > >>>>>>>>>>> Most of these libraries came out of the BigPetStore data > >>>>>>>>>>> generator > >>>>> but the > >>>>>>>>>>> other generators have been refactored to be based off the > >>>>>>>>>>> standard > >>>>> set of > >>>>>>>>>>> libraries. > > > > > > -- > > jay vyas > > >
