yes. I'll be out Monday, but around the rest of the week to discuss. S
On Fri, Jan 13, 2017 at 12:14 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi guys, > > sorry I'm a bit late on this thread. > > I think we might move forward step by step. I started to prepare the > ITests for Reddis and RabbitMQ IOs. I will submit the PRs. > > Agree with Stephen, we got an agreement, I think it makes sense to move > forward. > > Stephen, would you have some time next week to discuss/move forward > about that ? > > Thanks ! > Regards > JB > > On 01/13/2017 03:53 AM, Stephen Sisk wrote: > > Thanks, that helps. I've been thinking of this mostly for connection > > options which I'm hoping should be pretty generic. > > > > Do we have known examples that we will want to use this with today, or is > > this a hypothetical? If it's hypothetical for now, would it make sense to > > defer making changes to TestPipeline? > > > > I suspect I'm over-fixating too much on one part - basically sounds like > > we've got an agreement. I'm excited about this, and I'll plan on having > ITs > > that assume they get connection info from pipeline options. > > > > S > > > > On Thu, Jan 12, 2017 at 6:04 PM Lukasz Cwik <lc...@google.com.invalid> > > wrote: > > > >> Imagine you had a KafkaReadIT and a KafkaWriteIT. > >> Both ITs share the same Kafka configuration options (host, topic, ...) > >> found in KafkaITPipelineOptions yet the KafkaReadIT also needs to have > an > >> additional parameter (like position to start reading kafka topic from) > >> found in KafkaReadITPipelineOptions that extends PipelineOptions. > >> > >> > >> On Thu, Jan 12, 2017 at 5:56 PM, Stephen Sisk <s...@google.com.invalid> > >> wrote: > >> > >>> I see the need for/like KinesisIOTestPipelineOptions - it would allow > all > >>> ITs/perf tests that need a kinesis connection to get one. > >>> > >>> What problem are we trying to solve with KinesisIOTest1PipelineOptions > ? > >>> > >>> This presumes that there would also be KinesisIOTest2PipelineOptions ? > >> What > >>> would be different between the two? Maybe if we had a scenario and used > >>> "real" names (ie, what we'd actually call the variables) > >>> > >>> S > >>> > >>> On Thu, Jan 12, 2017 at 5:47 PM Jason Kuster <jasonkus...@google.com. > >>> invalid> > >>> wrote: > >>> > >>>> Stephen, I agree with you about option 2 -- I think it strikes the > >> right > >>>> balance. To your question at the end, Luke essentially answered it, > but > >>> to > >>>> elaborate: we could make e.g. KinesisIOTestPipelineOptions which > >> extends > >>>> TestPipelineOptions and then KinesisIOTest1PipelineOptions which > >> extends > >>>> KinesisIOTestPipelineOptions (whew, that's a workout for the fingers). > >> It > >>>> requires a change to how we parse PipelineOptions inside of > >>>> TestPipeline.java which means we lose a bit of user-friendliness, but > >>>> enables this scenario which I think is worth the tradeoff. > >>>> > >>>> On Tue, Jan 10, 2017 at 4:23 PM, Lukasz Cwik <lc...@google.com.invalid > >>> > >>>> wrote: > >>>> > >>>>> PipelineOptions allows you to aggregate common options within a > >> parent > >>>>> interface with child interfaces extending and inheriting those > >> options. > >>>>> > >>>>> For example you can have MyIOOptionsTest1 which extends MyIOOptions > >> and > >>>>> also MyIOOptionsTest2 which extends MyIOOptions. MyIOOptions can have > >>> all > >>>>> the shared fields such as port or host or whatever and then > >>>>> MyIOOptionsTest1 can have specific parameters for Test1 and > >>>>> MyIOOptionsTest2 can have specific parameters for Test2. > >>>>> > >>>>> > >>>>> On Tue, Jan 10, 2017 at 4:11 PM, Stephen Sisk > >> <s...@google.com.invalid > >>>> > >>>>> wrote: > >>>>> > >>>>>> thanks Jason for doing this research! There are a lot of options. > >>>>>> > >>>>>> You mentioned "Steven" - I assume you're talking about me :) > >>>>>> > >>>>>> As you mention in your goals, we want this to work for both > >>> individual > >>>>> devs > >>>>>> working on a particular store, as well as the beam CI system. I'm > >>>>> inclined > >>>>>> to favor making this pretty easy for individual devs, as long as it > >>>>> doesn't > >>>>>> come at the cost of making CI config painful. Given that, I suspect > >>>> that > >>>>>> we'd want something that's straightforward to discover how to > >>>> configure, > >>>>>> that there's an obvious error message if it's misconfigured, and is > >>>> easy > >>>>> to > >>>>>> learn. > >>>>>> > >>>>>> In that world, I would learn towards options 2 & 3. I think a file > >>> that > >>>>> has > >>>>>> to be maintained/configured separately outside of the repository > >>> isn't > >>>>> the > >>>>>> most developer friendly option. That may be somewhat of a workflow > >>>> thing. > >>>>>> And if it was super-important to keep the passwords secret, then a > >>> file > >>>>>> outside of the repository might be the only option that would work > >> - > >>>>>> however, I think we can simplify the problem and secure our test > >>>> servers > >>>>> by > >>>>>> keeping them on a private network. (I don't think we are an > >> important > >>>>>> target, so I'm not too worried about defense in depth) > >>>>>> > >>>>>> Between options 2 & 3 I would favor option 2 - I think having > >>> defaults > >>>>>> pre-set to work in the beam CI environment will lead to a lot of > >>>>> potential > >>>>>> confusion on the part of developers, since it would let them run > >> the > >>>>> test, > >>>>>> and then only give them errors about connecting to a server. I > >> think > >>>> it'd > >>>>>> be better to get errors about not having pipeline options set, or > >>>> errors > >>>>>> about obviously wrong/default pipeline option values, rather than > >>>>> "correct, > >>>>>> but not for your environment" settings. For example, if we have > >> valid > >>>>>> defaults, and a user only sets some of the values (ie, sets IP, but > >>> not > >>>>>> port), it'd be trickier to detect when the value being used is > >> valid. > >>>> I'd > >>>>>> want the test to be able to show users an error that the option > >>> wasn't > >>>>> set, > >>>>>> rather than have it try to connect and timeout b/c the port isn't > >>>>> correct. > >>>>>> > >>>>>> I also don't think that having per-IT options is a good idea. If 2 > >>>>>> different ITs can share a data store, we should be able to pass > >> them > >>>> the > >>>>>> same config - that also makes me strongly favor an option where we > >>>> have a > >>>>>> shared set of options. Do we *have* to make it part of the > >>>>>> TestPipelineOptions? Can we make something like a > >>>>> BeamIOTestPipelineOption? > >>>>>> > >>>>>> This generally leads me to favor option 2. > >>>>>> > >>>>>> S > >>>>>> > >>>>>> On Tue, Jan 10, 2017 at 3:39 PM Jason Kuster < > >> jasonkus...@google.com > >>> . > >>>>>> invalid> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> Following up on some of the discussions already on-list, I wanted > >>> to > >>>>>>> solicit some more feedback about some implementation details > >>>> regarding > >>>>>> the > >>>>>>> IO Integration Tests. > >>>>>>> > >>>>>>> As it currently stands, we mostly have IO ITs for GCP-based IO, > >>> which > >>>>> our > >>>>>>> GCP-based Jenkins executors handle natively, but as our > >> integration > >>>>> test > >>>>>>> coverage begins to expand, we're going to run into several of the > >>>>>> problems > >>>>>>> relevant to what Steven is doing with hosting data stores for use > >>> by > >>>>>> ITs. I > >>>>>>> wanted to get people's feedback about how to handle passing > >>>> credentials > >>>>>> to > >>>>>>> the ITs. We have a couple options, motivated by some goals. > >>>>>>> > >>>>>>> Goals: > >>>>>>> > >>>>>>> * Has to work in Apache Beam CI environment. > >>>>>>> * Has to run on dev machines (w/o access to beam CI environment). > >>>>>>> * One way of passing datastore config only. > >>>>>>> * An individual IT fails fast if run and it doesn't have valid > >>>> config. > >>>>>>> * IO performance tests will have a validation component (this > >>> implies > >>>>> we > >>>>>>> need to run the IO ITs, not just the IO IT pipelines). > >>>>>>> * Devs working on an individual IO transform can run Integration > >> & > >>>> perf > >>>>>>> tests without recreating the data stores every time > >>>>>>> * Devs working on a runner's IO can run all the IO integration & > >>> perf > >>>>>>> tests. They may have to recreate the data stores every time (or > >>>>> possibly > >>>>>>> have a manual config that helps with this.) It's okay if this > >> world > >>>> is > >>>>> a > >>>>>>> bit awkward, it just needs to be possible. > >>>>>>> > >>>>>>> > >>>>>>> Option 1: IO Configuration File > >>>>>>> > >>>>>>> The first option is to read all credentials from some file stored > >>> on > >>>>>> disk. > >>>>>>> We can define a location for an (xml, json, yaml, etc) file which > >>> we > >>>>> can > >>>>>>> read in each IT to find the credentials that IT needs. This > >> method > >>>> has > >>>>> a > >>>>>>> couple of upsides and a couple of downsides. > >>>>>>> > >>>>>>> * Upsides > >>>>>>> * Passing credentials to ITs, and adding new credentials, is > >>>>>> relatively > >>>>>>> easy. > >>>>>>> * Individual users can spin up their own data stores, put the > >>>>>>> credentials in the file, run their ITs and have things just work. > >>>>>>> * Downsides > >>>>>>> * Relying on a file, especially a file not checked in to the > >>>>>> repository > >>>>>>> (to prevent people from accidentally sharing credentials to their > >>>> data > >>>>>>> store, etc) is fragile and can lead to some confusing failure > >>> cases. > >>>>>>> * ITs are supposed to be self-contained; using a file on disk > >>>> makes > >>>>>>> things like running them in CI harder. > >>>>>>> * It seems like datastore location, username, and password > >> are > >>>>> things > >>>>>>> that are a better fit for the IT PipelineOptions anyway. > >>>>>>> > >>>>>>> > >>>>>>> Option 2: TestPipelineOptions > >>>>>>> > >>>>>>> Another option is to specify them as general PipelineOptions on > >>>>>>> TestPipelineOptions and then to build the specific IT's options > >>> from > >>>>>> there. > >>>>>>> For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3. > >> We > >>>>> could > >>>>>>> specify inside of TestPipelineOptions some options like > >>> "MyStoreIP", > >>>>>>> "MyStoreUsername", and "MyStorePassword", and then the command > >> for > >>>>>> invoking > >>>>>>> them would look like (omitting some irrelevant things): > >>>>>>> > >>>>>>> mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[..., > >>>>>>> "--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser", > >>>>>>> "--MyStorePassword=hunter2"]'. > >>>>>>> * Upsides > >>>>>>> * Test is self-contained -- no dependency on an external file > >>> and > >>>>> all > >>>>>>> relevant things can be specified on the command line; easier for > >>>> users > >>>>>> and > >>>>>>> CI. > >>>>>>> * Passing credentials to ITs via pipelineoptions feels > >> better. > >>>>>>> * Downsides > >>>>>>> * Harder to pass different credentials to one specific IT; > >>> e.g. I > >>>>>> want > >>>>>>> MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run > >>>> against > >>>>>>> 9.8.7.6. > >>>>>>> * Investing in this pattern means a proliferation of > >>>>>>> TestPipelineOptions. Potentially bad, especially for a CI suite > >>>>> running a > >>>>>>> large number of ITs -- size of command line args may become > >>>>> unmanageable > >>>>>>> with 5+ data stores. > >>>>>>> > >>>>>>> > >>>>>>> Option 3: Individual IT Options > >>>>>>> > >>>>>>> The last option I can think of is to specify the options directly > >>> on > >>>>> the > >>>>>>> IT's options, e.g. MyStoreIT1Options, and set defaults which work > >>>> well > >>>>>> for > >>>>>>> CI. This means that CI could run an entire suite of ITs without > >>>>>> specifying > >>>>>>> any arguments and trusting that the ITs' defaults will work, but > >>>> means > >>>>> an > >>>>>>> individual developer is potentially able to run only one IT at a > >>>> time, > >>>>>>> since it will be impossible to override all options from the > >>> command > >>>>>> line. > >>>>>>> * Upsides > >>>>>>> * Test is still self-contained, and even more so -- possible > >> to > >>>>>> specify > >>>>>>> args targeted at one IT in particular. > >>>>>>> * Args are specified right where they're used; way smaller > >>> chance > >>>>> of > >>>>>>> confusion or mistakes. > >>>>>>> * Easiest for CI -- as long as defaults for data store auth > >> and > >>>>>>> location are correct from the perspective of the Jenkins > >> executor, > >>> it > >>>>> can > >>>>>>> essentially just turn all ITs on and run them as is. > >>>>>>> * Downsides > >>>>>>> * Hardest for individual developers to run an entire suite of > >>> ITs > >>>>> -- > >>>>>>> since defaults are configured for running in CI environment, they > >>>> will > >>>>>>> likely fail when running on the user's machine, resulting in > >>>> annoyance > >>>>>> for > >>>>>>> the user. > >>>>>>> > >>>>>>> > >>>>>>> If anyone has thoughts on these, please let me know. > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Jason > >>>>>>> > >>>>>>> -- > >>>>>>> ------- > >>>>>>> Jason Kuster > >>>>>>> Apache Beam (Incubating) / Google Cloud Dataflow > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> ------- > >>>> Jason Kuster > >>>> Apache Beam (Incubating) / Google Cloud Dataflow > >>>> > >>> > >> > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >