Hi guys,
sorry I'm a bit late on this thread.
I think we might move forward step by step. I started to prepare the
ITests for Reddis and RabbitMQ IOs. I will submit the PRs.
Agree with Stephen, we got an agreement, I think it makes sense to move
forward.
Stephen, would you have some time next week to discuss/move forward
about that ?
Thanks !
Regards
JB
On 01/13/2017 03:53 AM, Stephen Sisk wrote:
Thanks, that helps. I've been thinking of this mostly for connection
options which I'm hoping should be pretty generic.
Do we have known examples that we will want to use this with today, or is
this a hypothetical? If it's hypothetical for now, would it make sense to
defer making changes to TestPipeline?
I suspect I'm over-fixating too much on one part - basically sounds like
we've got an agreement. I'm excited about this, and I'll plan on having ITs
that assume they get connection info from pipeline options.
S
On Thu, Jan 12, 2017 at 6:04 PM Lukasz Cwik <[email protected]>
wrote:
Imagine you had a KafkaReadIT and a KafkaWriteIT.
Both ITs share the same Kafka configuration options (host, topic, ...)
found in KafkaITPipelineOptions yet the KafkaReadIT also needs to have an
additional parameter (like position to start reading kafka topic from)
found in KafkaReadITPipelineOptions that extends PipelineOptions.
On Thu, Jan 12, 2017 at 5:56 PM, Stephen Sisk <[email protected]>
wrote:
I see the need for/like KinesisIOTestPipelineOptions - it would allow all
ITs/perf tests that need a kinesis connection to get one.
What problem are we trying to solve with KinesisIOTest1PipelineOptions ?
This presumes that there would also be KinesisIOTest2PipelineOptions ?
What
would be different between the two? Maybe if we had a scenario and used
"real" names (ie, what we'd actually call the variables)
S
On Thu, Jan 12, 2017 at 5:47 PM Jason Kuster <[email protected].
invalid>
wrote:
Stephen, I agree with you about option 2 -- I think it strikes the
right
balance. To your question at the end, Luke essentially answered it, but
to
elaborate: we could make e.g. KinesisIOTestPipelineOptions which
extends
TestPipelineOptions and then KinesisIOTest1PipelineOptions which
extends
KinesisIOTestPipelineOptions (whew, that's a workout for the fingers).
It
requires a change to how we parse PipelineOptions inside of
TestPipeline.java which means we lose a bit of user-friendliness, but
enables this scenario which I think is worth the tradeoff.
On Tue, Jan 10, 2017 at 4:23 PM, Lukasz Cwik <[email protected]
wrote:
PipelineOptions allows you to aggregate common options within a
parent
interface with child interfaces extending and inheriting those
options.
For example you can have MyIOOptionsTest1 which extends MyIOOptions
and
also MyIOOptionsTest2 which extends MyIOOptions. MyIOOptions can have
all
the shared fields such as port or host or whatever and then
MyIOOptionsTest1 can have specific parameters for Test1 and
MyIOOptionsTest2 can have specific parameters for Test2.
On Tue, Jan 10, 2017 at 4:11 PM, Stephen Sisk
<[email protected]
wrote:
thanks Jason for doing this research! There are a lot of options.
You mentioned "Steven" - I assume you're talking about me :)
As you mention in your goals, we want this to work for both
individual
devs
working on a particular store, as well as the beam CI system. I'm
inclined
to favor making this pretty easy for individual devs, as long as it
doesn't
come at the cost of making CI config painful. Given that, I suspect
that
we'd want something that's straightforward to discover how to
configure,
that there's an obvious error message if it's misconfigured, and is
easy
to
learn.
In that world, I would learn towards options 2 & 3. I think a file
that
has
to be maintained/configured separately outside of the repository
isn't
the
most developer friendly option. That may be somewhat of a workflow
thing.
And if it was super-important to keep the passwords secret, then a
file
outside of the repository might be the only option that would work
-
however, I think we can simplify the problem and secure our test
servers
by
keeping them on a private network. (I don't think we are an
important
target, so I'm not too worried about defense in depth)
Between options 2 & 3 I would favor option 2 - I think having
defaults
pre-set to work in the beam CI environment will lead to a lot of
potential
confusion on the part of developers, since it would let them run
the
test,
and then only give them errors about connecting to a server. I
think
it'd
be better to get errors about not having pipeline options set, or
errors
about obviously wrong/default pipeline option values, rather than
"correct,
but not for your environment" settings. For example, if we have
valid
defaults, and a user only sets some of the values (ie, sets IP, but
not
port), it'd be trickier to detect when the value being used is
valid.
I'd
want the test to be able to show users an error that the option
wasn't
set,
rather than have it try to connect and timeout b/c the port isn't
correct.
I also don't think that having per-IT options is a good idea. If 2
different ITs can share a data store, we should be able to pass
them
the
same config - that also makes me strongly favor an option where we
have a
shared set of options. Do we *have* to make it part of the
TestPipelineOptions? Can we make something like a
BeamIOTestPipelineOption?
This generally leads me to favor option 2.
S
On Tue, Jan 10, 2017 at 3:39 PM Jason Kuster <
[email protected]
.
invalid>
wrote:
Hi all,
Following up on some of the discussions already on-list, I wanted
to
solicit some more feedback about some implementation details
regarding
the
IO Integration Tests.
As it currently stands, we mostly have IO ITs for GCP-based IO,
which
our
GCP-based Jenkins executors handle natively, but as our
integration
test
coverage begins to expand, we're going to run into several of the
problems
relevant to what Steven is doing with hosting data stores for use
by
ITs. I
wanted to get people's feedback about how to handle passing
credentials
to
the ITs. We have a couple options, motivated by some goals.
Goals:
* Has to work in Apache Beam CI environment.
* Has to run on dev machines (w/o access to beam CI environment).
* One way of passing datastore config only.
* An individual IT fails fast if run and it doesn't have valid
config.
* IO performance tests will have a validation component (this
implies
we
need to run the IO ITs, not just the IO IT pipelines).
* Devs working on an individual IO transform can run Integration
&
perf
tests without recreating the data stores every time
* Devs working on a runner's IO can run all the IO integration &
perf
tests. They may have to recreate the data stores every time (or
possibly
have a manual config that helps with this.) It's okay if this
world
is
a
bit awkward, it just needs to be possible.
Option 1: IO Configuration File
The first option is to read all credentials from some file stored
on
disk.
We can define a location for an (xml, json, yaml, etc) file which
we
can
read in each IT to find the credentials that IT needs. This
method
has
a
couple of upsides and a couple of downsides.
* Upsides
* Passing credentials to ITs, and adding new credentials, is
relatively
easy.
* Individual users can spin up their own data stores, put the
credentials in the file, run their ITs and have things just work.
* Downsides
* Relying on a file, especially a file not checked in to the
repository
(to prevent people from accidentally sharing credentials to their
data
store, etc) is fragile and can lead to some confusing failure
cases.
* ITs are supposed to be self-contained; using a file on disk
makes
things like running them in CI harder.
* It seems like datastore location, username, and password
are
things
that are a better fit for the IT PipelineOptions anyway.
Option 2: TestPipelineOptions
Another option is to specify them as general PipelineOptions on
TestPipelineOptions and then to build the specific IT's options
from
there.
For example, say we have MyStoreIT1, MyStoreIT2 and MyStoreIT3.
We
could
specify inside of TestPipelineOptions some options like
"MyStoreIP",
"MyStoreUsername", and "MyStorePassword", and then the command
for
invoking
them would look like (omitting some irrelevant things):
mvn clean verify -DskipITs=false -DbeamTestPipelineOptions='[...,
"--MyStoreIP=1.2.3.4", "--MyStoreUsername=beamuser",
"--MyStorePassword=hunter2"]'.
* Upsides
* Test is self-contained -- no dependency on an external file
and
all
relevant things can be specified on the command line; easier for
users
and
CI.
* Passing credentials to ITs via pipelineoptions feels
better.
* Downsides
* Harder to pass different credentials to one specific IT;
e.g. I
want
MyStoreIT1 and 2 to run against 1.2.3.4, but MyStoreIT3 to run
against
9.8.7.6.
* Investing in this pattern means a proliferation of
TestPipelineOptions. Potentially bad, especially for a CI suite
running a
large number of ITs -- size of command line args may become
unmanageable
with 5+ data stores.
Option 3: Individual IT Options
The last option I can think of is to specify the options directly
on
the
IT's options, e.g. MyStoreIT1Options, and set defaults which work
well
for
CI. This means that CI could run an entire suite of ITs without
specifying
any arguments and trusting that the ITs' defaults will work, but
means
an
individual developer is potentially able to run only one IT at a
time,
since it will be impossible to override all options from the
command
line.
* Upsides
* Test is still self-contained, and even more so -- possible
to
specify
args targeted at one IT in particular.
* Args are specified right where they're used; way smaller
chance
of
confusion or mistakes.
* Easiest for CI -- as long as defaults for data store auth
and
location are correct from the perspective of the Jenkins
executor,
it
can
essentially just turn all ITs on and run them as is.
* Downsides
* Hardest for individual developers to run an entire suite of
ITs
--
since defaults are configured for running in CI environment, they
will
likely fail when running on the user's machine, resulting in
annoyance
for
the user.
If anyone has thoughts on these, please let me know.
Best,
Jason
--
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow
--
-------
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com