It is excellent to have this discussion and excitement :) I admit I only glanced at the email threads. I apologize if I am repeating some existing ideas. I wanted to share my thoughts:
- Focus on the future: Instead of going back to stuff we have not implemented, we can think about what the users of 2025+ would want. Streaming has become a lot more complex and exciting and is getting used more over time. We can make it easy for users to operate a fleet of 100s or 1000s+ of pipelines, easy to run, manage, observe, and debug. Maybe we can add concepts for "groups of pipelines" or concepts for "pipelines-made-of-sub-pipelines." These are both real use cases we have seen with bundling many small pipelines for efficiency reasons. (I think pipeline update, CICD, etc., would follow from here.) - Be use case-driven. We have many published use cases. They discuss the pros and cons. They could be both actionable (e.g., double down on solid parts and fix/or remove the weaker parts). - ML is obviously doing well, and Beam's turnkey transform idea is also doing well; we could expand on both. - Whatever we do, we need to make it a non-breaking change. Breaking changes turns out poorly for users and us. We might even gradually get to 3.0 - As we get closer, we should think about a way to market 3.0 with a big bang, I am sure there will be many ideas. Process wish: I hope we can find a structured way to make progress. When there is a lot of excitement, energy, and ideas, we must have a clear process for deciding what to do and how to build it to move this forward. Ahmet On Thu, Aug 22, 2024 at 3:51 PM XQ Hu via dev <dev@beam.apache.org> wrote: > Thanks a lot for these discussions so far! I really like all of the > thoughts. > If you have some time, please add these thoughts to these public doc: > https://docs.google.com/document/d/13r4NvuvFdysqjCTzMHLuUUXjKTIEY3d7oDNIHT6guww/ > Everyone should have the write permission. Feel free to add/edit themes as > well. > Again, thanks a lot! > For any folks who will attend Beam Summit 2024, see you all there and let > us have more casual chats during the summit! > > On Thu, Aug 22, 2024 at 5:07 PM Valentyn Tymofieiev via dev < > dev@beam.apache.org> wrote: > >> > Key to this will be a push to producing/consuming structured data (as >> has been mentioned) and also well-structured, >> language-agnostic configuration. >> >> > Unstructured data (aka "everything is bytes with coders") is overrated >> and should be an exception not the default. Structured data everywhere, >> with specialized bytes columns. >> >> +1. >> >> I am seeing a tendency in distributed data processing engines to heavily >> recommend and use relational APIs to express data-processing cases on >> structured data, for example, >> >> Flink has introduced the Table API: >> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/tableapi/ >> >> Spark has recently evolved their Dataframe API into a language-agnostic >> portability layer: >> https://spark.apache.org/docs/latest/spark-connect-overview.html >> Some less known and more recent data processing also offer a subset of >> Dataframe or SQL, and or a Dataframe API that is later translated into SQL. >> >> In contrast, in Beam, SQL and Dataframe apis are more limited add-ons, >> natively available in Java and Python SDKs respectively. It might be a >> worthwhile consideration to think whether introducing a first-class >> citizen relational API would make sense in Beam 3, and how it would impact >> Beam cross-runner portability story. >> >> On Thu, Aug 22, 2024 at 12:21 PM Robert Bradshaw via dev < >> dev@beam.apache.org> wrote: >> >>> Echoing many of the comments here, but organizing them under a single >>> theme, I would say a good focus for Beam 3.0 could be centering around >>> being more "transform-centric." Specifically: >>> >>> - Make it easy to mix and match transforms across pipelines and >>> environments (SDKs). Key to this will be a push to producing/consuming >>> structured data (as has been mentioned) and also well-structured, >>> language-agnostic configuration. >>> - Better encapsulation for transforms. The main culprit here is update >>> compatibility, but there may be other issues as well. Let's try to >>> actually solve that for both primitives and composites. >>> - Somewhat related to the above, I would love to actually solve the >>> early/late output issue, and I think retractions and sink triggers are >>> powerful paradigms we could develop to actually solve this issue in a >>> novel way. >>> - Continue to refine the idea of "best practices." This includes the >>> points above, as well as things like robust error handling, >>> monitoring, etc. >>> >>> Once we have these in place we are in a position to offer a powerful >>> catalogue of easy-to-use, well-focused transforms, both first and >>> third party. >>> >>> Note everything here can be backwards compatible. As a concrete >>> milestone for when we "reach" 3.0 I would say that our core set of >>> transforms have been updated to all reflect best practices (by >>> default?) and we have a way for third parties to also publish such >>> transforms. >>> >>> (One more bullet point, I would love to see us complete the migration >>> to 100% portable runners, including local runners, which will help >>> with the testing and development story, but will also be key to making >>> the above vision work.) >>> >>> On Thu, Aug 22, 2024 at 8:00 AM Kenneth Knowles <k...@apache.org> wrote: >>> > >>> > I think this is a good idea. Fun fact - I think the first time we >>> talked about "3.0" was 2018. >>> > >>> > I don't want to break users with 3.0 TBH, despite that being what a >>> major version bump suggests. But I also don't want a triple-digit minor >>> version. I think 3.0 is worthwhile if we have a new emphasis that is very >>> meaningful to users and contributors. >>> > >>> > >>> > A couple things I would say from experience with 2.0: >>> > >>> > - A lot of new model features are dropped before completion. Can we >>> make it easier to evolve? Maybe not, since in a way it is our "instruction >>> set". >>> > >>> > - Transforms that provide straightforward functionality have a big >>> impact: RunInference, IOs, etc. I like that these get more discussion now, >>> whereas early in the project a lot of focus was on primitives and runners. >>> > >>> > - Integrations like YAML (and there will be plenty more I'm sure) >>> that rely on transforms as true no-code black boxes with non-UDF >>> configuration seem like the next step in abstraction and ease of use. >>> > >>> > - Update compatibility needs, which break through all our >>> abstractions, have blocked innovative changes and UX improvements, and had >>> a chilling effect on refactoring and the things that make software continue >>> to approach Quality. >>> > >>> > >>> > And a few ideas I have about the future of the space, agreeing with XQ >>> and Jan >>> > >>> > - Unstructured data (aka "everything is bytes with coders") is >>> overrated and should be an exception not the default. Structured data >>> everywhere, with specialized bytes columns. We can make small steps in this >>> direction (and we are already). >>> > >>> > - Triggers are really not a great construct. "Sink triggers" map >>> better to use cases but how to implement them is a long adventure. But we >>> really can't live without *something* to manage early output / late input, >>> and the options in all other systems I am aware of are even worse. >>> > >>> > And a last thought is that we shouldn't continue to work on last >>> decade's problems, if we can avoid it. Maybe there is a core to Beam that >>> is imperfect but good enough (unification of batch & streaming; integration >>> of many languages; core primitives that apply to any engine capable of >>> handling our use cases) and what we want to do is focus on what we can >>> build on top of it. I think this is implied by everything in this thread so >>> far but I just wanted to say it explicitly. >>> > >>> > Kenn >>> > >>> > On Tue, Aug 20, 2024 at 9:03 AM Jan Lukavský <je...@seznam.cz> wrote: >>> >> >>> >> Formatting and coloring. :) >>> >> >>> >> ---- >>> >> >>> >> Hi XQ, >>> >> >>> >> thanks for starting this discussion! >>> >> >>> >> I agree we are getting to a point when discussion a major update of >>> Apache Beam might be good idea. Because such window of opportunity happens >>> only once in (quite many) years, I think we should try to use our current >>> experience with the Beam model itself and check if there is any room for >>> improvement there. First of all, we have some parts of the model itself >>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are >>> parts that are known to be error-prone, e.g. triggers. Another topic are >>> features that are missing in the current model, e.g. iterations (yes, I >>> know, general iterations might not be even possible, but it seems we can >>> create a reasonable constraints for them to work for cases that really >>> matter), last but not least we might want to re-think how we structure >>> transforms, because that has direct impact on how expensive it is to >>> implement a new runner (GBK/Combine vs stateful ParDo). >>> >> >>> >> Having said that, my suggestion would be to take a higher-level look >>> first, define which parts of the model are battle-tested enough we trust >>> them as a definite part of the 3.0 model, question all the others and then >>> iterate over this to come with a new proposition of the model, with focus >>> on what you emphasize - use cases, user-friendly APIs and concepts that >>> contain as few unexpected behavior as possible. A key part of this should >>> be discussion about how we position Beam on the market - simplicity and >>> correctness should be the key points, because practice shows people tend to >>> misunderstand the streaming concepts (which is absolutely understandable!). >>> >> >>> >> Best, >>> >> >>> >> Jan >>> >> >>> >> On 8/20/24 14:38, Jan Lukavský wrote: >>> >> >>> >> Hi XQ, >>> >> >>> >> thanks for starting this discussion! >>> >> >>> >> I agree we are getting to a point when discussion a major update of >>> Apache Beam might be good idea. Because such window of opportunity happens >>> only once in (quite many) years, I think we should try to use our current >>> experience with the Beam model itself and check if there is any room for >>> improvement there. First of all, we have some parts of the model itself >>> that are not implemented in Beam 2.0, e.g. retractions. Second, there are >>> parts that are known to be error-prone, e.g. triggers. Another topic are >>> features that are missing in the current model, e.g. iterations (yes, I >>> know, general iterations might not be even possible, but it seems we can >>> create a reasonable constraints for them to work for cases that really >>> matter), last but not least we might want to re-think how we structure >>> transforms, because that has direct impact on how expensive it is to >>> implement a new runner (GBK/Combine vs stateful ParDo). >>> >> >>> >> Having said that, my suggestion would be to take a higher-level look >>> first, define which parts of the model are battle-tested enough we trust >>> them as a definite part of the 3.0 model, question all the others and then >>> iterate over this to come with a new proposition of the model, with focus >>> on what you emphasize - use cases, user-friendly APIs and concepts that >>> contain as few unexpected behavior as possible. A key part of this should >>> be discussion about how we position Beam on the market - simplicity and >>> correctness should be the key points, because practice shows people tend to >>> misunderstand the streaming concepts (which is absolutely understandable!). >>> >> >>> >> Best, >>> >> >>> >> Jan >>> >> >>> >> On 8/19/24 23:17, XQ Hu via dev wrote: >>> >> >>> >> Hi Beam Community, >>> >> >>> >> Lately, I have been thinking about the future of Beam and the >>> potential roadmap towards Beam 3.0. After discussing this with my >>> colleagues at Google, I would like to open a discussion about the path for >>> us to move towards Beam 3.0. As we continue to enhance Beam 2 with new >>> features and improvements, it's important to look ahead and consider the >>> long-term vision for the project. >>> >> >>> >> Why Beam 3.0? >>> >> >>> >> I think there are several compelling reasons to start planning for >>> Beam 3.0: >>> >> >>> >> Opportunity for Major Enhancements: We can introduce significant >>> improvements and innovations. >>> >> >>> >> Mature Beam Primitives: We can re-evaluate and refine the core >>> primitives, ensuring their maturity, stability, and ease of use for >>> developers. >>> >> >>> >> Enhanced User Experience: We can introduce new features and APIs that >>> significantly improve the developer experience and cater to evolving use >>> cases, particularly in the machine learning domain. >>> >> >>> >> >>> >> Potential Vision for Beam 3 >>> >> >>> >> Best-in-Class for ML: Empower machine learning users with intuitive >>> Python interfaces for data processing, model deployment, and evaluation. >>> >> >>> >> Rich, Portable Transforms: A cross-language library of standardized >>> transforms, easily configured and managed via YAML. >>> >> >>> >> Streamlined Core: Simplified Beam primitives with clear semantics for >>> easier development and maintenance. >>> >> >>> >> Turnkey Solutions: A curated set of powerful transforms for common >>> data and ML tasks, including use-case-specific solutions. >>> >> >>> >> Simplified Streaming: Intuitive interfaces for streaming data with >>> robust support for time-sorted input, metrics, and notifications. >>> >> >>> >> Enhanced Single Runner capabilities: For use cases where a single >>> large box which can be kept effectively busy can solve the users needs. >>> >> >>> >> Key Themes >>> >> >>> >> User-Centric Design: Enhance the overall developer experience with >>> simplified APIs and streamlined workflows. >>> >> >>> >> Runner Consistency: Ensure identical functionality between local and >>> remote runners for seamless development and deployment. >>> >> >>> >> Ubiquitous Data Schema: Standardize data schemas for improved >>> interoperability and robustness. >>> >> >>> >> Expanded SDK Capabilities: Enrich SDKs with powerful new features >>> like splittable DataFrames, stable input guarantees, and time-sorted input >>> processing. >>> >> >>> >> Thriving Transform Ecosystem: Foster a rich ecosystem of portable, >>> managed turnkey transforms, available across all SDKs. >>> >> >>> >> Minimized Operational Overhead: Reduce complexity and maintenance >>> burden by splitting Beam into smaller, more focused repositories. >>> >> >>> >> Next Steps: >>> >> >>> >> I propose we start by discussing the following: >>> >> >>> >> High-Level Goals/Vision/Themes: What are the most important goals and >>> priorities for Beam 3.0? >>> >> >>> >> Potential Challenges: What are the biggest challenges we might face >>> during the transition to Beam 3.0? >>> >> >>> >> Timeline: What would be a realistic timeline for planning, >>> developing, and releasing Beam 3.0? >>> >> >>> >> This email thread primarily sparks conversations about the >>> anticipated features of Beam 3.0, however, there is currently no official >>> timeline commitment. To facilitate the discussions, I created a public doc >>> that we can collaborate on. >>> >> >>> >> I am excited to work with all of you to shape the future of Beam and >>> make it an even more powerful and user-friendly data processing framework! >>> >> >>> >> Meanwhile, I hope to see many of you at Beam Summit 2024 ( >>> https://beamsummit.org/), where we can have more in-depth conversations >>> about the future of Beam. >>> >> >>> >> Thanks, >>> >> >>> >> XQ Hu (GitHub: liferoad) >>> >> >>> >> Public Doc for gathering feedback: [Public] Beam 3.0: a discussion >>> doc (PTAL) >>> >>