I also think that at a high level the success of Beam as a
project/community and as a piece of software depends on having
multiple viable runners with healthy set of users and
contributors. The pieces that are missing to me:
*User-focused comparison of runners (and IOs)*
+1 to Jesse's. Automated capability tests don't really help this.
Benchmarks will be part of the story but are worth very little on
their own. Focusing on these is just choosing to measure things
that are easy to measure instead of addressing what is important,
which is in the end almost always qualitative.
*Automated integration tests on clusters*
We do need to know that runners and IOs "work" in a basic yes/no
manner on every commit/release, beyond unit tests. I am not
really willing to strongly claim to a potential user that
something "works" without this level of automation.
*More uniform operational experiences*
Setting up your Spark/Flink/Apex deployment should be different.
Launching a Beam pipeline on it should not be.
*Portability: Any SDK on any runner*
We have now one SDK on master and one SDK on a dev branch that
both support portable execution somewhat. Unfortunately we have
no major open source runner that supports portability*. "Java on
any runner" is not compelling enough any more, if it ever was.
----
Reviews: I agree our response latency is too slow. I do not agree
that our quality bar is too high; I think we should raise it
*significantly*. Our codebase fails tests for long periods. Our
tests need to be green enough that we are comfortable blocking
merges *even for unrelated failures*. We should be able to cut a
release any time, modulo known blocker-level bugs.
Runner dev: I think Etienne's point about making it more uniform
to add features to all runners actually is quite important, since
the portability framework is a lot harder than "translate a Beam
ParDo to XYZ's FlatMap" where they are both Java. And even the
support code we've been building is not obvious to use and
probably won't be for the foreseeable future. This fits well into
the "Ben thread" on technical ideas so I'll comment there.
Kenn
*We do have a local batch-only portable runner in Python
On Fri, Jan 26, 2018 at 10:09 AM, Lukasz Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
Etienne, for the cross runner coherence, the portability
framework is attempting to create an API across all runners
for job management and job execution. A lot of work still
needs to be done to define and implement these APIs and
migrate runners and SDKs to support it since the current set
of Java APIs are adhoc in usage and purpose. In my opinion,
development should really be focused to migrate runners and
SDKs to use these APIs to get developer coherence. Work is
slowly progressing on integrating them into the Java, Python,
and Go SDKs and there are several JIRA issues in this regard
but involvement from more people could help.
Some helpful pointers are:
https://s.apache.org/beam-runner-api
<https://s.apache.org/beam-runner-api>
https://s.apache.org/beam-fn-api
<https://s.apache.org/beam-fn-api>
https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability
<https://issues.apache.org/jira/browse/BEAM-3515?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20portability>
On Fri, Jan 26, 2018 at 7:21 AM, Etienne Chauchot
<echauc...@apache.org <mailto:echauc...@apache.org>> wrote:
Hi all,
Does anyone have comments about my point about dev
coherence across the runners?
Thanks
Etienne
Le 22/01/2018 à 16:16, Etienne Chauchot a écrit :
Thanks Davor for bringing this discussion up!
I particularly like that you listed the different
areas of improvement and proposed to assign people
based on their tastes.
I wanted to add a point about consistency across
runners, but through the dev point of view: I've been
working on a trans-runner feature lately (metrics
push agnostic of the runners) for which I compared
the behavior of the runners and wired up this feature
into the flink and spark runners themselves. I must
admit that I had a hard time figuring out how to wire
it up in the different runners and that it was
completely different between the runners. Also, their
use (or non-use) of runner-core facilities vary. Even
in the architecture of the tests: some, like spark,
own their validates runner tests in the runner module
and others runners run validates runner tests that
are owned by sdk-core module. I also noticed some
differences in the way to do streaming test: for some
runners to trigger streaming mode it is needed to use
an equivalent of direct runner's TestStream in the
pipeline but for others putting streaming=true in
pipelineOptions is enough.
=> long story short, IMHO I think that It could be
interesting to enhance the runner API to contain more
than run(). This could have the benefit to increase
the coherence between runners. Besides we would need
to find the correct balance between too many methods
in the runner api that would reduce the flexibility
of the runner implementations and too few methods
that would reduce the coherence between the runners.
=>In addition, to enhance the coherence (dev point of
view) between the runners, having all the runners run
the exact same validates runner tests in both batch
and streaming modes would be awesome!
Another thing: big +1 to have a programmatic way of
defining the capability matrix as Romain suggested.
Also agree on Ismaël's point about too flexible
concepts across runners (termination, bundling, ...)
Also big +1 to what Jessee wrote. I was myself in the
past in the user architect position, and I can
confirm that all the points that he mentioned are
accurate.
Best,
Etienne
Le 16/01/2018 à 17:39, Ismaël Mejía a écrit :
Thanks Davor for opening this discussion and HUGE
+1 to do this every
year or in cycles. I will fork this thread into a
new one for the
Culture / Project management part issues as
suggested.
About the diversity of users across runners
subject I think that this
requires more attention to unification and
implies at least work in
different areas:
* Automatized validation and consistent semantics
among runners
Users should be confident that moving their code
from one runner to
the other just works and the only way to ensure
this is by having a
runner to pass ValidatesRunner/TCK tests and with
this 'graduate' such
support as Romain suggested. The
capatility-matrix is really nice but
it is not a programmatic way to do this. And also
usually individual
features do work, but feature combinations
produce issues so we need
to have a more exact semantics to avoid these.
Some parts of Beam's semantics are loose (e.g.
bundle partiiioning,
pipeline termination, etc.), this I suppose has
been a design decision
to allow flexibility in the runners
implementation but it becomes
inconvenient when users move among runners and
have different results.
I am not sure if the current tradeoff is worth
the usability sacrifice
for the end user.
* Make user experience across runners a priority
Today all runners not only behave in different
ways but the way users
publish and package their applications differ. Of
course this is not a
trivial problem because deployment normally is a
end user problem, but
we can improve in this area, e.g. guaranteeing a
consistent deployment
mechanism across runners, and making IO
integration easier for example
when using multiple IOs and switching runners it
is easy to run into
conflicts, we should try to minimize this for the
end-users.
* Simplify operational tasks among runners
We need to add a minimum degree of consistent
observability across
runners. Of course Beam has metrics to do this,
but it is not enough,
an end-user that starts on one runner and moves
to another has to deal
with a totally different set of logs and
operational issues. We can
try to improve this too, of course without trying
to cover the full
spectrum but at least bringing some minimum set
of observability. I
hope that the current work on portability will
bring some improvements
in this area. But this is crucial for users that
probably pass more
time running (and dealing) with issues in their
jobs than writing
them.
We need to have some integration tests that
simulate common user
scenarios and some distribution use cases, e.g.
Probably the most
common data store used for streaming is Kafka (at
least in Open
Source). We should have an IT that tests some
common issues that can
arrive when you use kafka, what happens if a
kafka broker goes down,
does Beam continues to read without issue? what
about a new leader
election, do we continue to work as expected,
etc. Few projects have
something like this and this will send a clear
message that Beam cares
about reliability too.
Apart of these, I think we also need to work on:
* Simpler APIs + User friendly libraries.
I want to add a big thanks for Jesse for his list
on criteria that
people seek when they choose a framework for data
processing. And the
first point 'Will this dramatically improve the
problems I'm trying to
solve?' is super important. Of course Beam has
portability and a rich
model as its biggest assets but I have been
consistently asked in
conferences if Beam has libraries for graph
processing, CEP, Machine
Learning or a Scala API.
Of course we have had some progress with the
recent addition of the
SQL and hopefully the schema-aware PCollections
would help there too,
but there is still some way to go, and of course
this can not be
crucial considering the portability goals of Beam
but these libraries
are sometimes what make users to decide if they
use a tool or not, so
better have those than not.
These are the most important issues from my point
of view. my excuses
for the long email but this was the perfect
moment to discuss these.
One extra point I think we should write and agree
on a concise roadmap
and take a look at our progress on it at the
middle and the end of the
year as other communities do.
Regards,
Ismaël
On Mon, Jan 15, 2018 at 7:49 PM, Jesse Anderson
<je...@bigdatainstitute.io
<mailto:je...@bigdatainstitute.io>> wrote:
I think a focus on the runners is what's key
to Beam's adoption. The runners
are the foundation on which Beam sits. If the
runners don't work properly,
Beam won't work.
A focus on improved unit tests is a good
start, but isn't what's needed.
Compatibility matrices will help see how your
runner of choice stacks up,
but that requires too much knowledge of
Beam's internals to be
interpretable.
Imagine you're an (enterprise) architect
looking at adopting Beam. What do
you look at or what do you look for before
going deeper? What would make you
stick your neck out to adopt Beam? For my
experience, there are several/pass
fails along the way.
Here are a few of the common ones I've seen:
Will this dramatically improve the problems
I'm trying to solve? (not
writing APIs/better programming model/Beam's
better handling of windowing)
Can I get commercial support for Beam? (This
is changing soon)
Are other people using Beam with the
configuration and use case as me? (e.g.
I'm using Spark with Beam to process imagery.
Are others doing this in
production?)
Is there good documentation and books on the
subject? (Tyler's and others'
book will improve this)
Can I get my team trained on this new
technology? (I have Beam training and
Google has some cursory training)
I think the one the community can improve on
the most is the social proof of
Beam. I've tried to do this
(http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/
<http://www.jesse-anderson.com/2017/06/beam-2-0-q-and-a/>
and
http://www.jesse-anderson.com/2016/07/question-and-answers-with-the-apache-beam-team/
<http://www.jesse-anderson.com/2016/07/question-and-answers-with-the-apache-beam-team/>).
We need to get the message out more about
people using Beam in production,
which configuration they have, and what their
results were. I think we have
the social proof on Dataflow, but not as much
on Spark/Flink/Apex.
I think it's important to note that these
checks don't look at the hardcore
language or API semantics that we're working
on. These are much later stage
issues, if they're ever used at all.
In my experience with other open source
adoption at enterprises, it starts
with architects and works its way around the
organization from there.
Thanks,
Jesse
On Mon, Jan 15, 2018 at 8:14 AM Ted Yu
<yuzhih...@gmail.com
<mailto:yuzhih...@gmail.com>> wrote:
bq. are hard to detect in our unit-test
framework
Looks like more integration tests would
help discover bug / regression
more quickly. If committer reviewing the
PR has concern in this regard, the
concern should be stated on the PR so
that the contributor (and reviewer)
can spend more time in solidifying the
solution.
bq. I've gone and fixed these issues
myself when merging
We can make stricter checkstyle rules so
that the code wouldn't pass build
without addressing commonly known issues.
Cheers
On Sun, Jan 14, 2018 at 12:37 PM, Reuven
Lax <re...@google.com
<mailto:re...@google.com>> wrote:
I agree with the sentiment, but I
don't completely agree with the
criteria.
I think we need to be much better
about reviewing PRs. Some PRs languish
for too long before the reviewer gets
to it (and I've been guilty of this
too), which does not send a good
message. Also new PRs sometimes languish
because there is no reviewer
assigned; maybe we could write a
gitbot to
automatically assign a reviewer to
every new PR?
Also, I think that the bar for
merging a PR from a contributor
should not
be "the PR is perfect." It's
perfectly fine to merge a PR that
still has
some issues (especially if the issues
are stylistic). In the past when I've
done this, I've gone and fixed these
issues myself when merging. It was a
bit more work for me to fix these
things myself, but it was a small
price to
pay in order to portray Beam as a
welcoming place for contributions.
On the other hand, "the build does
not break" is - in my opinion - too
weak of a criterion for merging. A
few reasons for this:
* Beam is a data-processing
framework, and data integrity is
paramount.
If a reviewer sees an issue that
could lead to data loss (or
duplication, or
corruption), I don't think that PR
should be merged. Historically many such
issues only actually manifest at
scale, and are hard to detect in our
unit-test framework. (we also need to
invest in more at-scale tests to catch
such issues).
* Beam guarantees backwards
compatibility for users (except across
major versions). If a bad API gets
merged and released (and the chances of
"forgetting" about it before the
release is cut is unfortunately high), we
are stuck with it. This is less of an
issue for many other open-source
projects that do not make such a
compatibility guarantee, as they are able
to simply remove or fix the API in
the next version.
I think we still need honest review
of PRs, with the criteria being
stronger than "the build doesn't
break." However reviewers also need to be
reasonable about what they ask for.
Reuven
On Sun, Jan 14, 2018 at 11:19 AM, Ted
Yu <yuzhih...@gmail.com
<mailto:yuzhih...@gmail.com>> wrote:
bq. if a PR is basically right
(it does what it should) without
breaking
the build, then it has to be
merged fast
+1 on above.
This would give contributors
positive feedback.
On Sun, Jan 14, 2018 at 8:13 AM,
Jean-Baptiste Onofré
<j...@nanthrax.net
<mailto:j...@nanthrax.net>>
wrote:
Hi Davor,
Thanks a lot for this e-mail.
I would like to emphasize two
areas where we have to improve:
1. Apache way and community.
We still have to focus and
being dedicated
on our communities (both user
& dev). Helping, encouraging,
growing our
communities is key for the
project. Building bridges
between communities is
also very important. We have
to be more "accessible":
sometime simplifying
our discussions, showing more
interest and open minded in
the proposals
would help as well. I think
we do a good job already: we
just have to
improve.
2. Execution: a successful
project is a project with a
regular activity
in term of releases, fixes,
improvements.
Regarding the PR, I think
today we have a PR opened for
long. And I
think for three reasons:
- some are not ready, not
good enough, no question on
these ones
- some needs reviewer and
speed up: we have to be
careful on the open
PRs and review asap
- some are under review but
we have a lot of "ping pong"
and long
discussion, not always
justified. I already said
that on the mailing list
but, as for other Apache
projects, if a PR is
basically right (it does what
it should) without breaking
the build, then it has to be
merged fast. If it
requires additional changes
(tests, polishing,
improvements, ...), then it
can be addressed in new PRs.
As already mentioned in the
Beam 2.3.0 thread, we have to
adopt a
regular schedule for
releases. It's a best effort
to have a release every 2
months, whatever the release
will contain. That's
essential to maintain a
good activity in the project
and for the third party
projects using Beam.
Again, don't get me wrong: we
already do a good job ! It's
just area
where I think we have to improve.
Anyway, thanks for all the
hard work we are doing all
together !
Regards
JB
On 13/01/2018 05:12, Davor
Bonaci wrote:
Hi everyone --
Apache Beam was
established as a
top-level project a year
ago (on
December 21, to be
exact). This first
anniversary is a great
opportunity for
us to look back at the
past year, celebrate its
successes, learn from any
mistakes we have made,
and plan for the next 1+
years.
I’d like to invite
everyone in the
community, particularly
users and
observers on this mailing
list, to participate in
this discussion. Apache
Beam is your project and
I, for one, would much
appreciate your candid
thoughts and comments.
Just as some other
projects do, I’d like to
make this
“state of the project”
discussion an annual
tradition in this community.
In terms of successes,
the availability of the
first stable release,
version 2.0.0, was the
biggest and most
important milestone last
year.
Additionally, we have
expanded the project’s
breadth with new components,
including several new
runners, SDKs, and DSLs,
and interconnected a large
number of
storage/messaging systems
with new Beam IOs. In
terms of community
growth, crossing 200
lifetime individual
contributors and achieving 76
contributors to a single
release were other
highlights. We have
doubled the
number of committers, and
invited a handful of new
PMC members. Thanks to
each and every one of you
for making all of this
possible in our first year.
On the other hand, in
such a young project as
Beam, there are
naturally many areas for
improvement. This is the
principal purpose of this
thread (and any of its
forks). To organize the
separate discussions, I’d
suggest to fork separate
threads for different
discussion areas:
* Culture and governance
(anything related to
people and their
behavior)
* Community growth (what
can we do to further grow
a diverse and
vibrant community)
* Technical execution
(anything related to
releases, their frequency,
website, infrastructure)
* Feature roadmap for
2018 (what can we do to
make the project more
attractive to users, Beam
3.0, etc.).
I know many passionate
folks who particularly
care about each of these
areas, but let me call on
some folks from the
community to get things
started: Ismael for
culture, Gris for
community, JB for
technical execution,
and Ben for feature roadmap.
Perhaps we can use this
thread to discuss
project-wide vision. To seed
that discussion, I’d
start somewhat
provocatively -- we
aren’t doing so well
on the diversity of users
across runners, which is
very important to the
realization of the
project’s vision. Would
you agree, and would you be
willing to make it the
project’s #1 priority for
the next 1-2 years?
Thanks -- and please join
us in what would
hopefully be a productive
and informative
discussion that shapes
the future of this project!
Davor