Thanks for everyone's input! I've haven't add new code to the data generators (just refactorings, bug fixes) while this discussion is ongoing. I'd like to pick up the work again.
I think we can agree that we won't: * Add the data generators to required packages so users aren't forced to install them * Make Docker the only delivery mechanism and agree that the data generators will have CLIs. The BigPetStore data generator already has CLI support built into the jar: $ java -jar bigpetstore-data-generator-1.1.0-SNAPSHOT.jar outputDir nStores nCustomers nPurchasingModels simulationLength seed I think packaging, maven artifacts, Docker images, and integration with smoke tests can all be handled in subsequent JIRAs. If any of these interest you, please file the appropriate JIRA. :) Does this sound good to everyone? On Mon, Aug 31, 2015 at 7:08 PM, Konstantin Boudnik <[email protected]> wrote: > On Mon, Aug 31, 2015 at 08:01PM, jay vyas wrote: > > - I agree Gradle is better than "yet another script". > > > > - And docker container, even if suboptimal, as thin wrapper to gradle > > container is all you need to deliver the data-generator on the masses so > > they can try it w/ zero startup cost. > > That's true. But it didn't sound this way originally, hence I've asked. > > > On Mon, Aug 31, 2015 at 7:36 PM, Konstantin Boudnik <[email protected]> > wrote: > > > > > Why do would we need yet another script (and potentially an extra > readme to > > > explain its command options) when we have the gradle? > > > > > > Cos > > > > > > On Mon, Aug 31, 2015 at 10:51PM, Olaf Flebbe wrote: > > > > +1 to the CLI /shell script interface. > > > > > > > > If I can choose I like to have a apt-get install > bigtop-datagenerator , > > > running for instance > > > > > > > > bigtop-data-generatoroutputDir nStores nCustomers nPurchasingModels > > > simulationLength seed > > > > > > > > I can help out with packaging if needed. > > > > > > > > Why should we use the docker indirection for a plain CLI file ? Of > > > course, We can provide a trivial Dockerfile to create a container > supplying > > > a JVM and running the CLI ... But I do not like to depend our services > on > > > docker registry more than we do now. > > > > > > > > Olaf > > > > > > > > > > > > > > > > > Am 31.08.2015 um 16:40 schrieb Evans Ye <[email protected]>: > > > > > > > > > > I am very much like the shell script wrapper and docker image idea > > > since > > > > > that way we can integrate it directly with bigtop provisioner which > > > yield a > > > > > perfect ux for the whole things. I think its not too hard to do it > > > both, we > > > > > just need to add a parameter to turn the script into daemon mode. > I see > > > > > lots of image doing this way. > > > > > > > > > > docker run bigtop/bigtop-data-gen --scheme weather --size 5GB > --output > > > > > data-dir --etc foo --etc bar --daemon > > > > > 2015年8月31日 下午9:06於 "RJ Nowling" <[email protected]>寫道: > > > > > > > > > >> The BigPetStore, Bazaar, and weather data generators have > > > single-threaded > > > > >> command-line interfaces. We could do the same with the smaller > > > generators > > > > >> (names, locations, etc.) if there is interest. > > > > >> > > > > >> On Mon, Aug 31, 2015 at 5:24 AM, Jay Vyas < > > > [email protected]> > > > > >> wrote: > > > > >> > > > > >>> Nate: Good idea to abstract the interface one level higher.... > > > > >>> > > > > >>> How about a docker run command ? That is probably the easiest > way for > > > > >>> Linux folks to run one off Java apps nowadays. > > > > >>> > > > > >>> docker run bigtop/bigtop-data-gen --scheme weather --size 5GB > > > --output > > > > >>> data-dir --etc foo --etc bar > > > > >>> > > > > >>> I'm happy to curate such a docker image, I already am doing > something > > > > >> like > > > > >>> this in kube for bigtop-transaction-queue, which continuously > pumps > > > data > > > > >>> generator outputs into a REST endpoint or file > > > > >>> Queue... So it could be extended to support other generators. > > > > >>> > > > > >>> > > > > >>>> om> <[email protected]> wrote: > > > > >>>> > > > > >>>> Could picture at some point supporting something like this for > > > non-jvm > > > > >>> folk just looking for test/demo data: > > > > >>>> > > > > >>>> apt-get install bigtop-data-gen > > > > >>>> ~/ $ bigtop-data-gen --scheme weather --size 5GB --output > data-dir > > > > >>> --etc foo --etc bar > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> -----Original Message----- > > > > >>>> From: jay vyas [mailto:[email protected]] > > > > >>>> Sent: Sunday, August 30, 2015 5:11 PM > > > > >>>> To: [email protected] > > > > >>>> Subject: Re: Proposal for "BigTop Data Generators" > > > > >>>> > > > > >>>> Hola nate. Well, here are the Use cases I know of that I have > used > > > the > > > > >>> data generators for. > > > > >>>> > > > > >>>> Dockerfile: > > > > >>>> > > > > >>>> (1) for testing kubernetes. For this, I just use > transaction-queue > > > > >>> docker file. > > > > >>>> (2) for testing GlusterFS small file workloads, maybe with other > > > > >>> analytics tools... > > > > >>>> > > > > >>>> Maven repo > > > > >>>> > > > > >>>> (3) Java maprduce/ignite/spark applications, which can just add > a > > > mvn > > > > >>> repo when compiling. Java developers never add jars through RPM > > > repos. > > > > >>>> > > > > >>>> RPM/DEB packages: > > > > >>>> > > > > >>>> I could see people using an RPM/DEB data generator, and I'm not > > > against > > > > >>> it. But I simply don't know of any real world projects which > > > *currently* > > > > >>> need RPM/Deb packages, which is why I haven't bothered to > propose it > > > as a > > > > >>> requirement. Nevertheless linux packages are always a welcome > > > addition > > > > >> if > > > > >>> someone wants to create em ! > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>>> On Sun, Aug 30, 2015 at 4:34 PM, <[email protected]> wrote: > > > > >>>>> > > > > >>>>> Would container be in addition to deb/rpm, or instead of? If > > > latter > > > > >>>>> can we do deb/rpm as base then have container either created > from > > > them > > > > >>>>> or directly from artifacts? > > > > >>>>> > > > > >>>>> On test usage side, seems could probably break up tests into > > > > >>>>> base/required and then optional/add-on tests/test-suites. > Think > > > > >>>>> remember seeing mention of certain tests that are failing at > times > > > on > > > > >>>>> certain component(s) anyways in the core builds but don’t mean > that > > > > >>>>> the build is broken, so would make sense to have some clean up > > > around > > > > >>> those anyways. > > > > >>>>> > > > > >>>>> -----Original Message----- > > > > >>>>> From: RJ Nowling [mailto:[email protected]] > > > > >>>>> Sent: Sunday, August 30, 2015 1:11 PM > > > > >>>>> To: [email protected] > > > > >>>>> Subject: Re: Proposal for "BigTop Data Generators" > > > > >>>>> > > > > >>>>> I agree with the above. :) > > > > >>>>> > > > > >>>>> On Sun, Aug 30, 2015 at 11:19 AM, Jay Vyas > > > > >>>>> <[email protected]> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>>> Hi RJ. > > > > >>>>>> > > > > >>>>>> Maven repositories and docker containers for the transaction > queue > > > > >>>>>> are good enough IMO. That will give people a way to compose > them > > > in > > > > >>>>>> different idioms (one for Java folks, another for broader > Linux > > > > >>>>>> audience > > > > >>>>> ). > > > > >>>>>> > > > > >>>>>> I think the lib designs are fairly intuitive. I would say > that we > > > > >>>>>> should constrain them all to being written in Java or Groovy > to > > > keep > > > > >>>>>> the bigtop theme of "JVM for everything" :). > > > > >>>>>> > > > > >>>>>> Any particular questions you have around technical design can > be > > > > >>>>>> followed in a JIRA or else maybe a Readme spec that goes in > a top > > > > >>>>>> level of the data-generators dir... > > > > >>>>>> > > > > >>>>>>> On Aug 30, 2015, at 1:51 AM, RJ Nowling <[email protected]> > > > wrote: > > > > >>>>>>> > > > > >>>>>>> I'd like to keep this conversation going. > > > > >>>>>>> > > > > >>>>>>> So here are a few discussion points: > > > > >>>>>>> > > > > >>>>>>> 1. How do we want to make the data generators available? > Maven? > > > > >>>>>>> RPMs > > > > >>>>>> and > > > > >>>>>>> Debs? > > > > >>>>>>> > > > > >>>>>>> For now, I'm using a gradle multi-project build to easily > build > > > > >>>>>>> and > > > > >>>>>> install > > > > >>>>>>> the BPS data generators and its libraries into a local maven > > > repo. > > > > >>>>>>> This makes development easy. Eventually, I would like to > post > > > > >>>>>>> binaries > > > > >>>>>> through > > > > >>>>>>> Maven for easy integration by users. RPMs / Debs could be > > > > >>>>>>> interesting since I use a pattern where the data generators > are > > > > >>>>>>> libraries (to support application integration / > parallelization > > > by > > > > >>>>>>> the host framework) but also provide CLI drivers for local > > > testing. > > > > >>>>>>> > > > > >>>>>>> 2. The idea of using the data generators as part of the > smoke > > > > >>>>>>> tests came up. Since there is concern about making the data > > > > >>>>>>> generators required, we could offer the blueprints > (BigPetStore) > > > > >>>>>>> as optional smoke tests. Would that be a good compromise? > > > > >>>>>>> > > > > >>>>>>> 3. How will they be maintained? > > > > >>>>>>> > > > > >>>>>>> I'll certainly add myself to the maintainers list and will be > > > > >>>>>>> taking responsibility. I'm happy to have others help as > well if > > > > >>>>>>> anyone wants to > > > > >>>>>>> -- if not, that's cool, too. > > > > >>>>>>> > > > > >>>>>>> 4. Is anyone interested at all in discussing library APIs and > > > > >> designs? > > > > >>>>>>> What about internal interfaces and such? > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> My plan was to add at least one more data generator (weather > > > > >>>>>>> simulator) > > > > >>>>>> to > > > > >>>>>>> bigtop-data-generators in the short term. However, given the > > > > >>>>>>> concerns raised by Cos (more discussion needed) and Olaf > (don't > > > > >>>>>>> want to force data generators on unsuspecting users ;) ), I > would > > > > >>>>>>> like to reach some > > > > >>>>>> consensus > > > > >>>>>>> on what people are concerned about and solutions. > > > > >>>>>>> > > > > >>>>>>> On Thu, Aug 27, 2015 at 12:38 PM, Konstantin Boudnik > > > > >>>>>>> <[email protected]> > > > > >>>>>> wrote: > > > > >>>>>>> > > > > >>>>>>>> Fine by me. I have linked this thread to the JIRA ticket > that RJ > > > > >>>>>> created, > > > > >>>>>>>> so > > > > >>>>>>>> we have a way to connect one to another ;) > > > > >>>>>>>> > > > > >>>>>>>>> On Thu, Aug 27, 2015 at 01:02PM, Olaf Flebbe wrote: > > > > >>>>>>>>> Hi, > > > > >>>>>>>>> > > > > >>>>>>>>> I am not confident that moving important design discussions > > > with > > > > >>>>>>>>> impact > > > > >>>>>>>> to > > > > >>>>>>>>> the whole project to jira is a good idea. > > > > >>>>>>>>> > > > > >>>>>>>>> In the current JIRA Traffic storm it is not easy to > identify > > > and > > > > >>>>>>>>> follow > > > > >>>>>>>> important tickets. > > > > >>>>>>>>> > > > > >>>>>>>>> Please keep discussions on the list or at least, please > state > > > on > > > > >>>>>>>>> this > > > > >>>>>>>> list which Ticket to follow ... > > > > >>>>>>>>> > > > > >>>>>>>>> Olaf > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>>> Am 26.08.2015 um 22:56 schrieb Konstantin Boudnik < > > > > >> [email protected] > > > > >>>> : > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Wed, Aug 26, 2015 at 10:38PM, Olaf Flebbe wrote: > > > > >>>>>>>>>>> Hi, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Nive to have data generators in Bigtop. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> But please do not include it in bigtop_utils, since this > > > > >>>>>>>>>>> package is mandatory. Not everyone needs a data > generator . > > > > >>>>>>>>>> > > > > >>>>>>>>>> Yup. And let's move further design discussion to the JIRA! > > > > >>>>>>>>>> > > > > >>>>>>>>>>> Olaf > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> Am 26.08.2015 um 11:25 schrieb Jay Vyas < > > > > >>>>>> [email protected] > > > > >>>>>>>>> : > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Publishing the jar to bigtops maven is probably a good > first > > > > >>>>>>>>>>>> step > > > > >>>>>>>> ,Then apps can just include it as needed...?. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> I'm not against packaging if someone wants packages for > > > this. > > > > >>>>>>>>>>>> Maybe > > > > >>>>>>>> even include it in bigtop util ? > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Let's move to jira, > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> On Aug 25, 2015, at 9:41 PM, Konstantin Boudnik > > > > >>>>>>>>>>>>> <[email protected]> > > > > >>>>>>>> wrote: > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> It is pretty cool indeed! > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I wonder how it needs to be structured to be: > > > > >>>>>>>>>>>>> - easy to access/use from other components wherever it > is > > > > >>>>>>>>>>>>> needed > > > > >>>>>>>>>>>>> - doesn't interfere with the rest of the stack > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I guess one possible way would be to implement the > > > generator > > > > >>>>>>>>>>>>> as a > > > > >>>>>>>> set of maven > > > > >>>>>>>>>>>>> artifacts, that could be installed/consumed > transparently > > > by > > > > >>>>>>>>>>>>> just > > > > >>>>>>>> declaring a > > > > >>>>>>>>>>>>> dependency e.g as proposed via top-level component. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Another way is to have a new package like we do for > > > > >>>>>>>>>>>>> bigtop-utils > > > > >>>>>>>> and such. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Perhaps this discussion should be moved to JIRA or > shall we > > > > >>>>>>>> continue on the > > > > >>>>>>>>>>>>> dev@ ?? > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Cos > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> On Sun, Aug 23, 2015 at 11:53AM, RJ Nowling wrote: > > > > >>>>>>>>>>>>>> Hi BigTop, > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> I had a discussion with Jay yesterday, we'd like to > > > propose > > > > >>>>>>>>>>>>>> a new > > > > >>>>>>>> component > > > > >>>>>>>>>>>>>> for BigTop: BigTop Data Generators. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> BigTop Data Generators would consist of a common set > of > > > > >>>>>>>>>>>>>> libraries > > > > >>>>>>>> for > > > > >>>>>>>>>>>>>> building data generators and three example data > > > generators: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> * BigPetStore transaction generator (moved from > > > > >>>>>>>>>>>>>> BigPetStore) > > > > >>>>>>>>>>>>>> * BigTop Bazaar -- attendee movement and interactions > with > > > > >>>>>>>>>>>>>> booths > > > > >>>>>>>> on a > > > > >>>>>>>>>>>>>> showroom floor, at a conference, or at a mall > > > > >>>>>>>>>>>>>> * BigTop Weatherman -- stochastic weather simulation > > > > >>>>>>>> (temperature, wind > > > > >>>>>>>>>>>>>> speed, wind chill, rainfall, etc.) per zip code. > (From a > > > > >>>>>>>>>>>>>> model > > > > >>>>>>>> trained on > > > > >>>>>>>>>>>>>> NOAA historical weather data) > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> We believe that creating a common set of libraries > will > > > > >>>>>>>>>>>>>> have > > > > >>>>>>>> several > > > > >>>>>>>>>>>>>> benefits including: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> * Easier for others to build their own data generators > > > > >>>>>>>>>>>>>> * Make data generators smaller and easier to maintain > > > > >>>>>>>>>>>>>> * Share improvements across the data generators > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> More details on the libraries are below. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> BigPetStore will be continue to focus on building and > > > > >>>>>>>>>>>>>> maintaining blueprints, powered by the BigTop Data > > > > >> Generators. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Our vision is that we get all of Apache coming to > BigTop > > > > >>>>>>>>>>>>>> for tools > > > > >>>>>>>> for > > > > >>>>>>>>>>>>>> building better, more comprehensive blueprints. We > want > > > to > > > > >>>>>>>> support these > > > > >>>>>>>>>>>>>> efforts through data generators and the initial set of > > > > >>>>>>>>>>>>>> blueprint > > > > >>>>>>>> we've been > > > > >>>>>>>>>>>>>> building. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> If the community is generally in support of this, I > can > > > > >>>>>>>>>>>>>> create a > > > > >>>>>>>> top-level > > > > >>>>>>>>>>>>>> "bigtop-data-generators" directory and put the data > > > > >>>>>>>>>>>>>> generators and libraries in there. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Thanks! > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> RJ > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> ------- > > > > >>>>>>>>>>>>>> Library details: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> So far, I've extracted the following common libraries: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> * Samplers -- provides classes for PDFs and various > > > > >>>>>>>>>>>>>> samplers > > > > >>>>>>>>>>>>>> * Name generator -- data set and samplers for > generating > > > > >>>>>>>>>>>>>> names > > > > >>>>>>>>>>>>>> * Location data set -- data set and classes for US zip > > > > >>>>>>>>>>>>>> codes, > > > > >>>>>>>> their > > > > >>>>>>>>>>>>>> GPS coordinates, median house hold incomes, and > population > > > > >>>>>>>>>>>>>> sizes > > > > >>>>>>>>>>>>>> * Product generator -- library for enumerating > products > > > > >>>>>>>>>>>>>> from a specification file. Comes with default > > > > >>>>>>>>>>>>>> specifications for > > > > >>>>>>>> BigPetStore > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> I also expect that I'll add libraries for: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> * Particle simulation -- customer movement in a room > > > > >>>>>>>>>>>>>> * Latent factor model generation -- generate latent > > > > >>>>>>>>>>>>>> factors and customer weights to create something like > > > > >>> MovieLens data. > > > > >>>>>>>>>>>>>> Used in > > > > >>>>>>>> Bazaar > > > > >>>>>>>>>>>>>> for booth preferences and potentially in BigPetStore > for > > > > >>>>>>>>>>>>>> customer > > > > >>>>>>>> item > > > > >>>>>>>>>>>>>> preferences > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Most of these libraries came out of the BigPetStore > data > > > > >>>>>>>>>>>>>> generator > > > > >>>>>>>> but the > > > > >>>>>>>>>>>>>> other generators have been refactored to be based off > the > > > > >>>>>>>>>>>>>> standard > > > > >>>>>>>> set of > > > > >>>>>>>>>>>>>> libraries. > > > > >>>> > > > > >>>> > > > > >>>> -- > > > > >>>> jay vyas > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > > > > > > > > > > -- > > jay vyas >
