Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Ahmet Altay via dev Thu, 22 Aug 2024 22:06:25 -0700

It is excellent to have this discussion and excitement :)

I admit I only glanced at the email threads. I apologize if I am repeating
some existing ideas. I wanted to share my thoughts:


- Focus on the future: Instead of going back to stuff we have not
implemented, we can think about what the users of 2025+ would want.
Streaming has become a lot more complex and exciting and is getting used
more over time. We can make it easy for users to operate a fleet of 100s or
1000s+ of pipelines, easy to run, manage, observe, and debug. Maybe we can
add concepts for "groups of pipelines" or concepts for
"pipelines-made-of-sub-pipelines." These are both real use cases we have
seen with bundling many small pipelines for efficiency reasons. (I think
pipeline update, CICD, etc., would follow from here.)
- Be use case-driven. We have many published use cases. They discuss the
pros and cons. They could be both actionable (e.g., double down on solid
parts and fix/or remove the weaker parts).
- ML is obviously doing well, and Beam's turnkey transform idea is also
doing well; we could expand on both.
- Whatever we do, we need to make it a non-breaking change. Breaking
changes turns out poorly for users and us. We might even gradually get to
3.0
- As we get closer, we should think about a way to market 3.0 with a big
bang, I am sure there will be many ideas.

Process wish: I hope we can find a structured way to make progress. When
there is a lot of excitement, energy, and ideas, we must have a clear
process for deciding what to do and how to build it to move this forward.

Ahmet



On Thu, Aug 22, 2024 at 3:51 PM XQ Hu via dev <dev@beam.apache.org> wrote:

> Thanks a lot for these discussions so far! I really like all of the
> thoughts.
> If you have some time, please add these thoughts to these public doc:
> https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/
> Everyone should have the write permission. Feel free to add/edit themes as
> well.
> Again, thanks a lot!
> For any folks who will attend Beam Summit 2024, see you all there and let
> us have more casual chats during the summit!
>
> On Thu, Aug 22, 2024 at 5:07 PM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> >  Key to this will be a push to producing/consuming structured data (as
>> has been mentioned) and also well-structured,
>> language-agnostic configuration.
>>
>> > Unstructured data (aka "everything is bytes with coders") is overrated
>> and should be an exception not the default. Structured data everywhere,
>> with specialized bytes columns.
>>
>> +1.
>>
>> I am seeing a tendency in distributed data processing engines to heavily
>> recommend and use relational APIs to express data-processing cases on
>> structured data, for example,
>>
>> Flink has introduced the Table API:
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/
>>
>> Spark has recently evolved their Dataframe API into a language-agnostic
>> portability layer:
>> https://spark.apache.org/docs/latest/spark-connect-overview.html
>> Some less known and more recent data processing also offer a subset of
>> Dataframe or SQL, and  or a Dataframe API that is later translated into SQL.
>>
>> In contrast, in Beam, SQL and Dataframe apis are more limited add-ons,
>> natively available in Java and Python SDKs respectively. It might be a
>> worthwhile consideration  to think whether introducing a first-class
>> citizen relational API would make sense in Beam 3, and how it would impact
>> Beam cross-runner portability story.
>>
>> On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Echoing many of the comments here, but organizing them under a single
>>> theme, I would say a good focus for Beam 3.0 could be centering around
>>> being more "transform-centric." Specifically:
>>>
>>> - Make it easy to mix and match transforms across pipelines and
>>> environments (SDKs). Key to this will be a push to producing/consuming
>>> structured data (as has been mentioned) and also well-structured,
>>> language-agnostic configuration.
>>> - Better encapsulation for transforms. The main culprit here is update
>>> compatibility, but there may be other issues as well. Let's try to
>>> actually solve that for both primitives and composites.
>>> - Somewhat related to the above, I would love to actually solve the
>>> early/late output issue, and I think retractions and sink triggers are
>>> powerful paradigms we could develop to actually solve this issue in a
>>> novel way.
>>> - Continue to refine the idea of "best practices." This includes the
>>> points above, as well as things like robust error handling,
>>> monitoring, etc.
>>>
>>> Once we have these in place we are in a position to offer a powerful
>>> catalogue of easy-to-use, well-focused transforms, both first and
>>> third party.
>>>
>>> Note everything here can be backwards compatible. As a concrete
>>> milestone for when we "reach" 3.0 I would say that our core set of
>>> transforms have been updated to all reflect best practices (by
>>> default?) and we have a way for third parties to also publish such
>>> transforms.
>>>
>>> (One more bullet point, I would love to see us complete the migration
>>> to 100% portable runners, including local runners, which will help
>>> with the testing and development story, but will also be key to making
>>> the above vision work.)
>>>
>>> On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote:
>>> >
>>> > I think this is a good idea. Fun fact - I think the first time we
>>> talked about "3.0" was 2018.
>>> >
>>> > I don't want to break users with 3.0 TBH, despite that being what a
>>> major version bump suggests. But I also don't want a triple-digit minor
>>> version. I think 3.0 is worthwhile if we have a new emphasis that is very
>>> meaningful to users and contributors.
>>> >
>>> >
>>> > A couple things I would say from experience with 2.0:
>>> >
>>> >  - A lot of new model features are dropped before completion. Can we
>>> make it easier to evolve? Maybe not, since in a way it is our "instruction
>>> set".
>>> >
>>> >  - Transforms that provide straightforward functionality have a big
>>> impact: RunInference, IOs, etc. I like that these get more discussion now,
>>> whereas early in the project a lot of focus was on primitives and runners.
>>> >
>>> >  - Integrations like YAML (and there will be plenty more I'm sure)
>>> that rely on transforms as true no-code black boxes with non-UDF
>>> configuration seem like the next step in abstraction and ease of use.
>>> >
>>> >  - Update compatibility needs, which break through all our
>>> abstractions, have blocked innovative changes and UX improvements, and had
>>> a chilling effect on refactoring and the things that make software continue
>>> to approach Quality.
>>> >
>>> >
>>> > And a few ideas I have about the future of the space, agreeing with XQ
>>> and Jan
>>> >
>>> >  - Unstructured data (aka "everything is bytes with coders") is
>>> overrated and should be an exception not the default. Structured data
>>> everywhere, with specialized bytes columns. We can make small steps in this
>>> direction (and we are already).
>>> >
>>> >  - Triggers are really not a great construct. "Sink triggers" map
>>> better to use cases but how to implement them is a long adventure. But we
>>> really can't live without *something* to manage early output / late input,
>>> and the options in all other systems I am aware of are even worse.
>>> >
>>> > And a last thought is that we shouldn't continue to work on last
>>> decade's problems, if we can avoid it. Maybe there is a core to Beam that
>>> is imperfect but good enough (unification of batch & streaming; integration
>>> of many languages; core primitives that apply to any engine capable of
>>> handling our use cases) and what we want to do is focus on what we can
>>> build on top of it. I think this is implied by everything in this thread so
>>> far but I just wanted to say it explicitly.
>>> >
>>> > Kenn
>>> >
>>> > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote:
>>> >>
>>> >> Formatting and coloring. :)
>>> >>
>>> >> ----
>>> >>
>>> >> Hi XQ,
>>> >>
>>> >> thanks for starting this discussion!
>>> >>
>>> >> I agree we are getting to a point when discussion a major update of
>>> Apache Beam might be good idea. Because such window of opportunity happens
>>> only once in (quite many) years, I think we should try to use our current
>>> experience with the Beam model itself and check if there is any room for
>>> improvement there. First of all, we have some parts of the model itself
>>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
>>> parts that are known to be error-prone, e.g. triggers. Another topic are
>>> features that are missing in the current model, e.g. iterations (yes, I
>>> know, general iterations might not be even possible, but it seems we can
>>> create a reasonable constraints for them to work for cases that really
>>> matter), last but not least we might want to re-think how we structure
>>> transforms, because that has direct impact on how expensive it is to
>>> implement a new runner (GBK/Combine vs stateful ParDo).
>>> >>
>>> >> Having said that, my suggestion would be to take a higher-level look
>>> first, define which parts of the model are battle-tested enough we trust
>>> them as a definite part of the 3.0 model, question all the others and then
>>> iterate over this to come with a new proposition of the model, with focus
>>> on what you emphasize - use cases, user-friendly APIs and concepts that
>>> contain as few unexpected behavior as possible. A key part of this should
>>> be discussion about how we position Beam on the market - simplicity and
>>> correctness should be the key points, because practice shows people tend to
>>> misunderstand the streaming concepts (which is absolutely understandable!).
>>> >>
>>> >> Best,
>>> >>
>>> >>  Jan
>>> >>
>>> >> On 8/20/24 14:38, Jan Lukavský wrote:
>>> >>
>>> >> Hi XQ,
>>> >>
>>> >> thanks for starting this discussion!
>>> >>
>>> >> I agree we are getting to a point when discussion a major update of
>>> Apache Beam might be good idea. Because such window of opportunity happens
>>> only once in (quite many) years, I think we should try to use our current
>>> experience with the Beam model itself and check if there is any room for
>>> improvement there. First of all, we have some parts of the model itself
>>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
>>> parts that are known to be error-prone, e.g. triggers. Another topic are
>>> features that are missing in the current model, e.g. iterations (yes, I
>>> know, general iterations might not be even possible, but it seems we can
>>> create a reasonable constraints for them to work for cases that really
>>> matter), last but not least we might want to re-think how we structure
>>> transforms, because that has direct impact on how expensive it is to
>>> implement a new runner (GBK/Combine vs stateful ParDo).
>>> >>
>>> >> Having said that, my suggestion would be to take a higher-level look
>>> first, define which parts of the model are battle-tested enough we trust
>>> them as a definite part of the 3.0 model, question all the others and then
>>> iterate over this to come with a new proposition of the model, with focus
>>> on what you emphasize - use cases, user-friendly APIs and concepts that
>>> contain as few unexpected behavior as possible. A key part of this should
>>> be discussion about how we position Beam on the market - simplicity and
>>> correctness should be the key points, because practice shows people tend to
>>> misunderstand the streaming concepts (which is absolutely understandable!).
>>> >>
>>> >> Best,
>>> >>
>>> >>  Jan
>>> >>
>>> >> On 8/19/24 23:17, XQ Hu via dev wrote:
>>> >>
>>> >> Hi Beam Community,
>>> >>
>>> >> Lately, I have been thinking about the future of Beam and the
>>> potential roadmap towards Beam 3.0. After discussing this with my
>>> colleagues at Google, I would like to open a discussion about the path for
>>> us to move towards Beam 3.0. As we continue to enhance Beam 2 with new
>>> features and improvements, it's important to look ahead and consider the
>>> long-term vision for the project.
>>> >>
>>> >> Why Beam 3.0?
>>> >>
>>> >> I think there are several compelling reasons to start planning for
>>> Beam 3.0:
>>> >>
>>> >> Opportunity for Major Enhancements: We can introduce significant
>>> improvements and innovations.
>>> >>
>>> >> Mature Beam Primitives: We can re-evaluate and refine the core
>>> primitives, ensuring their maturity, stability, and ease of use for
>>> developers.
>>> >>
>>> >> Enhanced User Experience: We can introduce new features and APIs that
>>> significantly improve the developer experience and cater to evolving use
>>> cases, particularly in the machine learning domain.
>>> >>
>>> >>
>>> >> Potential Vision for Beam 3
>>> >>
>>> >> Best-in-Class for ML: Empower machine learning users with intuitive
>>> Python interfaces for data processing, model deployment, and evaluation.
>>> >>
>>> >> Rich, Portable Transforms: A cross-language library of standardized
>>> transforms, easily configured and managed via YAML.
>>> >>
>>> >> Streamlined Core: Simplified Beam primitives with clear semantics for
>>> easier development and maintenance.
>>> >>
>>> >> Turnkey Solutions: A curated set of powerful transforms for common
>>> data and ML tasks, including use-case-specific solutions.
>>> >>
>>> >> Simplified Streaming: Intuitive interfaces for streaming data with
>>> robust support for time-sorted input, metrics, and notifications.
>>> >>
>>> >> Enhanced Single Runner capabilities: For use cases where a single
>>> large box which can be kept effectively busy can solve the users needs.
>>> >>
>>> >> Key Themes
>>> >>
>>> >> User-Centric Design: Enhance the overall developer experience with
>>> simplified APIs and streamlined workflows.
>>> >>
>>> >> Runner Consistency: Ensure identical functionality between local and
>>> remote runners for seamless development and deployment.
>>> >>
>>> >> Ubiquitous Data Schema: Standardize data schemas for improved
>>> interoperability and robustness.
>>> >>
>>> >> Expanded SDK Capabilities: Enrich SDKs with powerful new features
>>> like splittable DataFrames, stable input guarantees, and time-sorted input
>>> processing.
>>> >>
>>> >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable,
>>> managed turnkey transforms, available across all SDKs.
>>> >>
>>> >> Minimized Operational Overhead: Reduce complexity and maintenance
>>> burden by splitting Beam into smaller, more focused repositories.
>>> >>
>>> >> Next Steps:
>>> >>
>>> >> I propose we start by discussing the following:
>>> >>
>>> >> High-Level Goals/Vision/Themes: What are the most important goals and
>>> priorities for Beam 3.0?
>>> >>
>>> >> Potential Challenges: What are the biggest challenges we might face
>>> during the transition to Beam 3.0?
>>> >>
>>> >> Timeline: What would be a realistic timeline for planning,
>>> developing, and releasing Beam 3.0?
>>> >>
>>> >> This email thread primarily sparks conversations about the
>>> anticipated features of Beam 3.0, however, there is currently no official
>>> timeline commitment. To facilitate the discussions, I created a public doc
>>> that we can collaborate on.
>>> >>
>>> >> I am excited to work with all of you to shape the future of Beam and
>>> make it an even more powerful and user-friendly data processing framework!
>>> >>
>>> >> Meanwhile, I hope to see many of you at Beam Summit 2024 (
>>> https://beamsummit.org/), where we can have more in-depth conversations
>>> about the future of Beam.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> XQ Hu (GitHub: liferoad)
>>> >>
>>> >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion
>>> doc (PTAL)
>>>
>>

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Reply via email to