Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Kenneth Knowles Fri, 23 Aug 2024 07:44:34 -0700

With regards to the process of approaching Beam 3.0:

A lot of what we describe would just be new stuff that goes into Beam 2.XX
as well. This is all good as far as I'm concerned. If there was something
where we want to change the default, we could release it early under a
`--preview-3.0` flag or some such.


So there won't be a bunch of features that drop all at once or anything.
That's impractical. Which makes me think that launching 3.0 with a "big
bang" might be a combination of reaching a milestone we are happy with + a
documentation/marketing adjustment to the webpage, to put our new emphasis
in the spotlight.

I think it would also be excellent to think of ways we can allow 3.XX to
evolve faster. A great example that we recently implemented was the
"--updateCompatibiliyVersion" flag. This allows us to maintain update
compatibility for users who need it without freezing development forever.
Especially as we focus on libraries of composites to serve use cases, I
think the need to evolve and not be stuck on our first draft will be really
important.

Kenn

On Fri, Aug 23, 2024 at 5:28 AM Danny McCormick via dev <dev@beam.apache.org>
wrote:

> I'm generally +1 on doing this as well. Things I'm interested in are:
>
> - Expanded turnkey transform support (especially ML). I think moving Beam
> beyond just being a core "here's some pieces, build it yourself" SDK to a
> tool that can solve business problems is useful.
> --- Corollary - if we're increasingly focusing on use cases/transforms,
> there's going to be an increasing amount of pieces which are only relevant
> to certain users. I'm interested in reorganizing our release so that we
> release smaller composable units (e.g. we can version IOs or transforms
> independently of Beam core)
> - +1 on at the very least keeping breaking changes minimal. I haven't seen
> much that would require them anyways in the conversation above.
> - Where possible, move towards portability + structured data. I'm a little
> less bullish on Kenn's "Unstructured data (aka "everything is bytes with
> coders") is overrated and should be an exception not the default." (I don't
> think this always works well for ML, for example, which doesn't always fit
> into Beam schemas), but I'm generally on board with making the structured
> experience the best lit path.
> - +1 on a better portable (and local) experience; in general, I think a
> goal of Beam 3 should be to not be Java first. I think we've said we're not
> Java first, but in practice it is still well ahead of the other languages.
> Burning down gaps in the SDKs and making the local experience good
> regardless of language(s) would be awesome.
>
> > Process wish: I hope we can find a structured way to make progress. When
> there is a lot of excitement, energy, and ideas, we must have a clear
> process for deciding what to do and how to build it to move this forward.
>
> Agreed - it seems like we have consensus this is a good idea. It also
> seems like there's generally good momentum for a few themes/work items, so
> it seems like trying to formalize this a bit makes sense:
>
> As a first step I'd suggest we start to add items to the proposed work
> items in
> https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/edit#heading=h.mugv92ccok3l.
> I added a few items, and would encourage others to do so as well.
>
> From there, we can try to get consensus and add priorities, then create
> issues/a label to track the issue burndown.
>
> Thanks,
> Danny
>
> On Fri, Aug 23, 2024 at 6:05 AM Ahmet Altay via dev <dev@beam.apache.org>
> wrote:
>
>> It is excellent to have this discussion and excitement :)
>>
>> I admit I only glanced at the email threads. I apologize if I am
>> repeating some existing ideas. I wanted to share my thoughts:
>>
>> - Focus on the future: Instead of going back to stuff we have not
>> implemented, we can think about what the users of 2025+ would want.
>> Streaming has become a lot more complex and exciting and is getting used
>> more over time. We can make it easy for users to operate a fleet of 100s or
>> 1000s+ of pipelines, easy to run, manage, observe, and debug. Maybe we can
>> add concepts for "groups of pipelines" or concepts for
>> "pipelines-made-of-sub-pipelines." These are both real use cases we have
>> seen with bundling many small pipelines for efficiency reasons. (I think
>> pipeline update, CICD, etc., would follow from here.)
>> - Be use case-driven. We have many published use cases. They discuss the
>> pros and cons. They could be both actionable (e.g., double down on solid
>> parts and fix/or remove the weaker parts).
>> - ML is obviously doing well, and Beam's turnkey transform idea is also
>> doing well; we could expand on both.
>> - Whatever we do, we need to make it a non-breaking change. Breaking
>> changes turns out poorly for users and us. We might even gradually get to
>> 3.0
>> - As we get closer, we should think about a way to market 3.0 with a big
>> bang, I am sure there will be many ideas.
>>
>> Process wish: I hope we can find a structured way to make progress. When
>> there is a lot of excitement, energy, and ideas, we must have a clear
>> process for deciding what to do and how to build it to move this forward.
>>
>> Ahmet
>>
>>
>>
>> On Thu, Aug 22, 2024 at 3:51 PM XQ Hu via dev <dev@beam.apache.org>
>> wrote:
>>
>>> Thanks a lot for these discussions so far! I really like all of the
>>> thoughts.
>>> If you have some time, please add these thoughts to these public doc:
>>> https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/
>>> Everyone should have the write permission. Feel free to add/edit themes
>>> as well.
>>> Again, thanks a lot!
>>> For any folks who will attend Beam Summit 2024, see you all there and
>>> let us have more casual chats during the summit!
>>>
>>> On Thu, Aug 22, 2024 at 5:07 PM Valentyn Tymofieiev via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> >  Key to this will be a push to producing/consuming structured data
>>>> (as has been mentioned) and also well-structured,
>>>> language-agnostic configuration.
>>>>
>>>> > Unstructured data (aka "everything is bytes with coders") is
>>>> overrated and should be an exception not the default. Structured data
>>>> everywhere, with specialized bytes columns.
>>>>
>>>> +1.
>>>>
>>>> I am seeing a tendency in distributed data processing engines to
>>>> heavily recommend and use relational APIs to express data-processing cases
>>>> on structured data, for example,
>>>>
>>>> Flink has introduced the Table API:
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/
>>>>
>>>> Spark has recently evolved their Dataframe API into a language-agnostic
>>>> portability layer:
>>>> https://spark.apache.org/docs/latest/spark-connect-overview.html
>>>> Some less known and more recent data processing also offer a subset of
>>>> Dataframe or SQL, and  or a Dataframe API that is later translated into 
>>>> SQL.
>>>>
>>>> In contrast, in Beam, SQL and Dataframe apis are more limited add-ons,
>>>> natively available in Java and Python SDKs respectively. It might be a
>>>> worthwhile consideration  to think whether introducing a first-class
>>>> citizen relational API would make sense in Beam 3, and how it would impact
>>>> Beam cross-runner portability story.
>>>>
>>>> On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> Echoing many of the comments here, but organizing them under a single
>>>>> theme, I would say a good focus for Beam 3.0 could be centering around
>>>>> being more "transform-centric." Specifically:
>>>>>
>>>>> - Make it easy to mix and match transforms across pipelines and
>>>>> environments (SDKs). Key to this will be a push to producing/consuming
>>>>> structured data (as has been mentioned) and also well-structured,
>>>>> language-agnostic configuration.
>>>>> - Better encapsulation for transforms. The main culprit here is update
>>>>> compatibility, but there may be other issues as well. Let's try to
>>>>> actually solve that for both primitives and composites.
>>>>> - Somewhat related to the above, I would love to actually solve the
>>>>> early/late output issue, and I think retractions and sink triggers are
>>>>> powerful paradigms we could develop to actually solve this issue in a
>>>>> novel way.
>>>>> - Continue to refine the idea of "best practices." This includes the
>>>>> points above, as well as things like robust error handling,
>>>>> monitoring, etc.
>>>>>
>>>>> Once we have these in place we are in a position to offer a powerful
>>>>> catalogue of easy-to-use, well-focused transforms, both first and
>>>>> third party.
>>>>>
>>>>> Note everything here can be backwards compatible. As a concrete
>>>>> milestone for when we "reach" 3.0 I would say that our core set of
>>>>> transforms have been updated to all reflect best practices (by
>>>>> default?) and we have a way for third parties to also publish such
>>>>> transforms.
>>>>>
>>>>> (One more bullet point, I would love to see us complete the migration
>>>>> to 100% portable runners, including local runners, which will help
>>>>> with the testing and development story, but will also be key to making
>>>>> the above vision work.)
>>>>>
>>>>> On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > I think this is a good idea. Fun fact - I think the first time we
>>>>> talked about "3.0" was 2018.
>>>>> >
>>>>> > I don't want to break users with 3.0 TBH, despite that being what a
>>>>> major version bump suggests. But I also don't want a triple-digit minor
>>>>> version. I think 3.0 is worthwhile if we have a new emphasis that is very
>>>>> meaningful to users and contributors.
>>>>> >
>>>>> >
>>>>> > A couple things I would say from experience with 2.0:
>>>>> >
>>>>> >  - A lot of new model features are dropped before completion. Can we
>>>>> make it easier to evolve? Maybe not, since in a way it is our "instruction
>>>>> set".
>>>>> >
>>>>> >  - Transforms that provide straightforward functionality have a big
>>>>> impact: RunInference, IOs, etc. I like that these get more discussion now,
>>>>> whereas early in the project a lot of focus was on primitives and runners.
>>>>> >
>>>>> >  - Integrations like YAML (and there will be plenty more I'm sure)
>>>>> that rely on transforms as true no-code black boxes with non-UDF
>>>>> configuration seem like the next step in abstraction and ease of use.
>>>>> >
>>>>> >  - Update compatibility needs, which break through all our
>>>>> abstractions, have blocked innovative changes and UX improvements, and had
>>>>> a chilling effect on refactoring and the things that make software 
>>>>> continue
>>>>> to approach Quality.
>>>>> >
>>>>> >
>>>>> > And a few ideas I have about the future of the space, agreeing with
>>>>> XQ and Jan
>>>>> >
>>>>> >  - Unstructured data (aka "everything is bytes with coders") is
>>>>> overrated and should be an exception not the default. Structured data
>>>>> everywhere, with specialized bytes columns. We can make small steps in 
>>>>> this
>>>>> direction (and we are already).
>>>>> >
>>>>> >  - Triggers are really not a great construct. "Sink triggers" map
>>>>> better to use cases but how to implement them is a long adventure. But we
>>>>> really can't live without *something* to manage early output / late input,
>>>>> and the options in all other systems I am aware of are even worse.
>>>>> >
>>>>> > And a last thought is that we shouldn't continue to work on last
>>>>> decade's problems, if we can avoid it. Maybe there is a core to Beam that
>>>>> is imperfect but good enough (unification of batch & streaming; 
>>>>> integration
>>>>> of many languages; core primitives that apply to any engine capable of
>>>>> handling our use cases) and what we want to do is focus on what we can
>>>>> build on top of it. I think this is implied by everything in this thread 
>>>>> so
>>>>> far but I just wanted to say it explicitly.
>>>>> >
>>>>> > Kenn
>>>>> >
>>>>> > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz>
>>>>> wrote:
>>>>> >>
>>>>> >> Formatting and coloring. :)
>>>>> >>
>>>>> >> ----
>>>>> >>
>>>>> >> Hi XQ,
>>>>> >>
>>>>> >> thanks for starting this discussion!
>>>>> >>
>>>>> >> I agree we are getting to a point when discussion a major update of
>>>>> Apache Beam might be good idea. Because such window of opportunity happens
>>>>> only once in (quite many) years, I think we should try to use our current
>>>>> experience with the Beam model itself and check if there is any room for
>>>>> improvement there. First of all, we have some parts of the model itself
>>>>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
>>>>> parts that are known to be error-prone, e.g. triggers. Another topic are
>>>>> features that are missing in the current model, e.g. iterations (yes, I
>>>>> know, general iterations might not be even possible, but it seems we can
>>>>> create a reasonable constraints for them to work for cases that really
>>>>> matter), last but not least we might want to re-think how we structure
>>>>> transforms, because that has direct impact on how expensive it is to
>>>>> implement a new runner (GBK/Combine vs stateful ParDo).
>>>>> >>
>>>>> >> Having said that, my suggestion would be to take a higher-level
>>>>> look first, define which parts of the model are battle-tested enough we
>>>>> trust them as a definite part of the 3.0 model, question all the others 
>>>>> and
>>>>> then iterate over this to come with a new proposition of the model, with
>>>>> focus on what you emphasize - use cases, user-friendly APIs and concepts
>>>>> that contain as few unexpected behavior as possible. A key part of this
>>>>> should be discussion about how we position Beam on the market - simplicity
>>>>> and correctness should be the key points, because practice shows people
>>>>> tend to misunderstand the streaming concepts (which is absolutely
>>>>> understandable!).
>>>>> >>
>>>>> >> Best,
>>>>> >>
>>>>> >>  Jan
>>>>> >>
>>>>> >> On 8/20/24 14:38, Jan Lukavský wrote:
>>>>> >>
>>>>> >> Hi XQ,
>>>>> >>
>>>>> >> thanks for starting this discussion!
>>>>> >>
>>>>> >> I agree we are getting to a point when discussion a major update of
>>>>> Apache Beam might be good idea. Because such window of opportunity happens
>>>>> only once in (quite many) years, I think we should try to use our current
>>>>> experience with the Beam model itself and check if there is any room for
>>>>> improvement there. First of all, we have some parts of the model itself
>>>>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are
>>>>> parts that are known to be error-prone, e.g. triggers. Another topic are
>>>>> features that are missing in the current model, e.g. iterations (yes, I
>>>>> know, general iterations might not be even possible, but it seems we can
>>>>> create a reasonable constraints for them to work for cases that really
>>>>> matter), last but not least we might want to re-think how we structure
>>>>> transforms, because that has direct impact on how expensive it is to
>>>>> implement a new runner (GBK/Combine vs stateful ParDo).
>>>>> >>
>>>>> >> Having said that, my suggestion would be to take a higher-level
>>>>> look first, define which parts of the model are battle-tested enough we
>>>>> trust them as a definite part of the 3.0 model, question all the others 
>>>>> and
>>>>> then iterate over this to come with a new proposition of the model, with
>>>>> focus on what you emphasize - use cases, user-friendly APIs and concepts
>>>>> that contain as few unexpected behavior as possible. A key part of this
>>>>> should be discussion about how we position Beam on the market - simplicity
>>>>> and correctness should be the key points, because practice shows people
>>>>> tend to misunderstand the streaming concepts (which is absolutely
>>>>> understandable!).
>>>>> >>
>>>>> >> Best,
>>>>> >>
>>>>> >>  Jan
>>>>> >>
>>>>> >> On 8/19/24 23:17, XQ Hu via dev wrote:
>>>>> >>
>>>>> >> Hi Beam Community,
>>>>> >>
>>>>> >> Lately, I have been thinking about the future of Beam and the
>>>>> potential roadmap towards Beam 3.0. After discussing this with my
>>>>> colleagues at Google, I would like to open a discussion about the path for
>>>>> us to move towards Beam 3.0. As we continue to enhance Beam 2 with new
>>>>> features and improvements, it's important to look ahead and consider the
>>>>> long-term vision for the project.
>>>>> >>
>>>>> >> Why Beam 3.0?
>>>>> >>
>>>>> >> I think there are several compelling reasons to start planning for
>>>>> Beam 3.0:
>>>>> >>
>>>>> >> Opportunity for Major Enhancements: We can introduce significant
>>>>> improvements and innovations.
>>>>> >>
>>>>> >> Mature Beam Primitives: We can re-evaluate and refine the core
>>>>> primitives, ensuring their maturity, stability, and ease of use for
>>>>> developers.
>>>>> >>
>>>>> >> Enhanced User Experience: We can introduce new features and APIs
>>>>> that significantly improve the developer experience and cater to evolving
>>>>> use cases, particularly in the machine learning domain.
>>>>> >>
>>>>> >>
>>>>> >> Potential Vision for Beam 3
>>>>> >>
>>>>> >> Best-in-Class for ML: Empower machine learning users with intuitive
>>>>> Python interfaces for data processing, model deployment, and evaluation.
>>>>> >>
>>>>> >> Rich, Portable Transforms: A cross-language library of standardized
>>>>> transforms, easily configured and managed via YAML.
>>>>> >>
>>>>> >> Streamlined Core: Simplified Beam primitives with clear semantics
>>>>> for easier development and maintenance.
>>>>> >>
>>>>> >> Turnkey Solutions: A curated set of powerful transforms for common
>>>>> data and ML tasks, including use-case-specific solutions.
>>>>> >>
>>>>> >> Simplified Streaming: Intuitive interfaces for streaming data with
>>>>> robust support for time-sorted input, metrics, and notifications.
>>>>> >>
>>>>> >> Enhanced Single Runner capabilities: For use cases where a single
>>>>> large box which can be kept effectively busy can solve the users needs.
>>>>> >>
>>>>> >> Key Themes
>>>>> >>
>>>>> >> User-Centric Design: Enhance the overall developer experience with
>>>>> simplified APIs and streamlined workflows.
>>>>> >>
>>>>> >> Runner Consistency: Ensure identical functionality between local
>>>>> and remote runners for seamless development and deployment.
>>>>> >>
>>>>> >> Ubiquitous Data Schema: Standardize data schemas for improved
>>>>> interoperability and robustness.
>>>>> >>
>>>>> >> Expanded SDK Capabilities: Enrich SDKs with powerful new features
>>>>> like splittable DataFrames, stable input guarantees, and time-sorted input
>>>>> processing.
>>>>> >>
>>>>> >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable,
>>>>> managed turnkey transforms, available across all SDKs.
>>>>> >>
>>>>> >> Minimized Operational Overhead: Reduce complexity and maintenance
>>>>> burden by splitting Beam into smaller, more focused repositories.
>>>>> >>
>>>>> >> Next Steps:
>>>>> >>
>>>>> >> I propose we start by discussing the following:
>>>>> >>
>>>>> >> High-Level Goals/Vision/Themes: What are the most important goals
>>>>> and priorities for Beam 3.0?
>>>>> >>
>>>>> >> Potential Challenges: What are the biggest challenges we might face
>>>>> during the transition to Beam 3.0?
>>>>> >>
>>>>> >> Timeline: What would be a realistic timeline for planning,
>>>>> developing, and releasing Beam 3.0?
>>>>> >>
>>>>> >> This email thread primarily sparks conversations about the
>>>>> anticipated features of Beam 3.0, however, there is currently no official
>>>>> timeline commitment. To facilitate the discussions, I created a public doc
>>>>> that we can collaborate on.
>>>>> >>
>>>>> >> I am excited to work with all of you to shape the future of Beam
>>>>> and make it an even more powerful and user-friendly data processing
>>>>> framework!
>>>>> >>
>>>>> >> Meanwhile, I hope to see many of you at Beam Summit 2024 (
>>>>> https://beamsummit.org/), where we can have more in-depth
>>>>> conversations about the future of Beam.
>>>>> >>
>>>>> >> Thanks,
>>>>> >>
>>>>> >> XQ Hu (GitHub: liferoad)
>>>>> >>
>>>>> >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion
>>>>> doc (PTAL)
>>>>>
>>>>

Re: [DISCUSS] Beam 3.0: Paving the Path to the Next Generation Data Processing Framework

Reply via email to