Hey everyone, During PR reviews, we saw that there are multiple valid alternatives for a dynamic destinations configuration with regards to naming and positioning.
@Robert Bradshaw <rober...@google.com> put together a doc [1] enumerating some of these options -- we hope to get some feedback on what looks best to you! [1] https://docs.google.com/document/d/1IIn4cjF9eYASnjSmVmmAt6ymFnpBxHgBKVPgpnQ12G4/edit?usp=sharing On Thu, Aug 29, 2024 at 4:15 PM Ahmed Abualsaud <ahmedabuals...@google.com> wrote: > Big thanks to everyone for the ongoing discussions in this thread and on > the doc! > > The implementation to enable portable dynamic destinations is now underway > - GitHub tracker: https://github.com/apache/beam/issues/32365 > > Best, > Ahmed > > On Wed, Apr 3, 2024 at 1:00 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> On Wed, Apr 3, 2024 at 4:15 AM Kenneth Knowles <k...@apache.org> wrote: >> >>> Let me summarize the most recent proposal on-list to frame my question >>> about this last suggestion. It looks like this: >>> >>> 1. user has an element, call it `data` >>> 2. user maps `data` to an arbitrary metadata row, call it `dest` >>> 3. we can do things like shuffle on `dest` because it isn't too big >>> 4. we map `dest` to a concrete destination (aka URL) to write to by a >>> string format that uses fields of `dest` >>> >>> I believe steps 1-3 are identical is expressivity to non-portable >>> DynamicDestinations. So Reuven the question is for step 4: what are the >>> mappings from `dest` to URL that cannot be expressed by string formatting >>> but need SQL or Lua, etc? That would be a useful guide to consideration of >>> those possibilities. >>> >> >> I think any non-trivial mapping can be done in step 2. It may be possible >> to come up with a case where something other than string substitution is >> needed to be done to make dest small enough to shuffle, but I think that'd >> be a really rare corner case, and then it's just an optimization rather >> than feature completeness question. >> >> >>> FWIW I think even if we add a mini-language that string formatting has >>> better ease of use (can easily be displayed in UI, etc) so it would be the >>> first choice, and more advanced stuff is a fallback for rare cases. So they >>> are both valuable and I'd be happy to implement the easier-to-use path >>> right away while we discuss. >>> >> >> +1. Note that this even lets us share the config "path/table/..." field >> that is a static string for non-dynamic destinations. >> >> In light of the above, let's avoid a complex mini-language. I'd start >> with nothing but plugging things in w/o any formatting options. >> >> >>> On Tue, Apr 2, 2024 at 2:59 PM Reuven Lax via dev <dev@beam.apache.org> >>> wrote: >>> >>>> I do suspect that over time we'll find more and more cases we can't >>>> express, and will be asked to extend this little templating in more >>>> directions. To head that off - could we easily just reuse an existing >>>> language (SQL, LUA, something of the form?) instead of creating something >>>> new? >>>> >>>> On Tue, Apr 2, 2024 at 8:55 AM Kenneth Knowles <k...@apache.org> wrote: >>>> >>>>> I really like this proposal. I think it has narrowed down and solved >>>>> the essential problem of not shuffling excess redundant data, and also >>>>> provides the vast majority of the functionality that a lambda would, with >>>>> significantly better debugability and usability too, since the dynamic >>>>> destination pattern string can be in display data, etc. >>>>> >>>>> Kenn >>>>> >>>>> On Wed, Mar 27, 2024 at 1:58 PM Robert Bradshaw via dev < >>>>> dev@beam.apache.org> wrote: >>>>> >>>>>> On Wed, Mar 27, 2024 at 10:20 AM Reuven Lax <re...@google.com> wrote: >>>>>> >>>>>>> Can the prefix still be generated programmatically at graph creation >>>>>>> time? >>>>>>> >>>>>> >>>>>> Yes. It's just a property of the transform passed by the user at >>>>>> configuration time. >>>>>> >>>>>> >>>>>>> On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw <rober...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax <re...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This does seem like the best compromise, though I think there will >>>>>>>>> still end up being performance issues. A common pattern I've seen is >>>>>>>>> that >>>>>>>>> there is a long common prefix to the dynamic destination followed the >>>>>>>>> dynamic component. e.g. the destination might be >>>>>>>>> long/common/path/to/destination/files/<per-user-file>. In this case, >>>>>>>>> the >>>>>>>>> prefix is often much larger than messages themselves and is what gets >>>>>>>>> effectively encoded in the lambda. >>>>>>>>> >>>>>>>> >>>>>>>> The idea here is that the destination would be given as a format >>>>>>>> string, say, "long/common/path/to/destination/files/{dest_info.user}". >>>>>>>> Another way to put this is that we support (only) "lambdas" that are >>>>>>>> represented as string substitutions. (The fact that dest_info does not >>>>>>>> have >>>>>>>> to be part of the record, and can be the output of an arbitrary map if >>>>>>>> need >>>>>>>> be, makes this restriction not so bad.) >>>>>>>> >>>>>>>> As well as solving the performance issues, I think this is actually >>>>>>>> a pretty convenient and natural way for the user to name their >>>>>>>> destination >>>>>>>> (for the common usecase, even easier than providing a lambda), and has >>>>>>>> the >>>>>>>> benefit of being much more transparent than an arbitrary callable as >>>>>>>> well >>>>>>>> for introspection (for both machine and human that may look at the >>>>>>>> resulting pipeline). >>>>>>>> >>>>>>>> >>>>>>>>> I'm not entirely sure how to address this in a portable context. >>>>>>>>> We might simply have to accept the extra overhead when going cross >>>>>>>>> language. >>>>>>>>> >>>>>>>>> Reuven >>>>>>>>> >>>>>>>>> On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev < >>>>>>>>> dev@beam.apache.org> wrote: >>>>>>>>> >>>>>>>>>> Thanks for putting this together, it will be a really >>>>>>>>>> useful feature to have. >>>>>>>>>> >>>>>>>>>> I am in favor of the string-pattern approaches. I think we need >>>>>>>>>> to support both the {record=..., dest_info=...} and the elide-fields >>>>>>>>>> approaches, as the former is nicer when one has a fixed >>>>>>>>>> representation for >>>>>>>>>> the output record (e.g. a proto or avro schema) and the flattened >>>>>>>>>> form for >>>>>>>>>> ease of use in more free-form contexts (e.g. when producing records >>>>>>>>>> from >>>>>>>>>> YAML and SQL). >>>>>>>>>> >>>>>>>>>> Also left some comments on the doc. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Mar 27, 2024 at 6:51 AM Ahmed Abualsaud via dev < >>>>>>>>>> dev@beam.apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> Hey all, >>>>>>>>>>> >>>>>>>>>>> There have been some conversations lately about how best to >>>>>>>>>>> enable dynamic destinations in a portable context. Usually, this >>>>>>>>>>> comes up >>>>>>>>>>> for cross-language transforms and more recently for Beam YAML. >>>>>>>>>>> >>>>>>>>>>> I've started a short doc outlining some routes we could take. >>>>>>>>>>> The purpose is to establish a good standard for supporting dynamic >>>>>>>>>>> destinations with portability, one that can be applied to most use >>>>>>>>>>> cases >>>>>>>>>>> and IOs. Please take a look and add any thoughts! >>>>>>>>>>> >>>>>>>>>>> https://s.apache.org/portable-dynamic-destinations >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Ahmed >>>>>>>>>>> >>>>>>>>>>