Let me summarize the most recent proposal on-list to frame my question about this last suggestion. It looks like this:
1. user has an element, call it `data` 2. user maps `data` to an arbitrary metadata row, call it `dest` 3. we can do things like shuffle on `dest` because it isn't too big 4. we map `dest` to a concrete destination (aka URL) to write to by a string format that uses fields of `dest` I believe steps 1-3 are identical is expressivity to non-portable DynamicDestinations. So Reuven the question is for step 4: what are the mappings from `dest` to URL that cannot be expressed by string formatting but need SQL or Lua, etc? That would be a useful guide to consideration of those possibilities. FWIW I think even if we add a mini-language that string formatting has better ease of use (can easily be displayed in UI, etc) so it would be the first choice, and more advanced stuff is a fallback for rare cases. So they are both valuable and I'd be happy to implement the easier-to-use path right away while we discuss. Kenn On Tue, Apr 2, 2024 at 2:59 PM Reuven Lax via dev <dev@beam.apache.org> wrote: > I do suspect that over time we'll find more and more cases we can't > express, and will be asked to extend this little templating in more > directions. To head that off - could we easily just reuse an existing > language (SQL, LUA, something of the form?) instead of creating something > new? > > On Tue, Apr 2, 2024 at 8:55 AM Kenneth Knowles <k...@apache.org> wrote: > >> I really like this proposal. I think it has narrowed down and solved the >> essential problem of not shuffling excess redundant data, and also provides >> the vast majority of the functionality that a lambda would, with >> significantly better debugability and usability too, since the dynamic >> destination pattern string can be in display data, etc. >> >> Kenn >> >> On Wed, Mar 27, 2024 at 1:58 PM Robert Bradshaw via dev < >> dev@beam.apache.org> wrote: >> >>> On Wed, Mar 27, 2024 at 10:20 AM Reuven Lax <re...@google.com> wrote: >>> >>>> Can the prefix still be generated programmatically at graph creation >>>> time? >>>> >>> >>> Yes. It's just a property of the transform passed by the user at >>> configuration time. >>> >>> >>>> On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax <re...@google.com> wrote: >>>>> >>>>>> This does seem like the best compromise, though I think there will >>>>>> still end up being performance issues. A common pattern I've seen is that >>>>>> there is a long common prefix to the dynamic destination followed the >>>>>> dynamic component. e.g. the destination might be >>>>>> long/common/path/to/destination/files/<per-user-file>. In this case, the >>>>>> prefix is often much larger than messages themselves and is what gets >>>>>> effectively encoded in the lambda. >>>>>> >>>>> >>>>> The idea here is that the destination would be given as a format >>>>> string, say, "long/common/path/to/destination/files/{dest_info.user}". >>>>> Another way to put this is that we support (only) "lambdas" that are >>>>> represented as string substitutions. (The fact that dest_info does not >>>>> have >>>>> to be part of the record, and can be the output of an arbitrary map if >>>>> need >>>>> be, makes this restriction not so bad.) >>>>> >>>>> As well as solving the performance issues, I think this is actually a >>>>> pretty convenient and natural way for the user to name their destination >>>>> (for the common usecase, even easier than providing a lambda), and has the >>>>> benefit of being much more transparent than an arbitrary callable as well >>>>> for introspection (for both machine and human that may look at the >>>>> resulting pipeline). >>>>> >>>>> >>>>>> I'm not entirely sure how to address this in a portable context. We >>>>>> might simply have to accept the extra overhead when going cross language. >>>>>> >>>>>> Reuven >>>>>> >>>>>> On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev < >>>>>> dev@beam.apache.org> wrote: >>>>>> >>>>>>> Thanks for putting this together, it will be a really useful feature >>>>>>> to have. >>>>>>> >>>>>>> I am in favor of the string-pattern approaches. I think we need to >>>>>>> support both the {record=..., dest_info=...} and the elide-fields >>>>>>> approaches, as the former is nicer when one has a fixed representation >>>>>>> for >>>>>>> the output record (e.g. a proto or avro schema) and the flattened form >>>>>>> for >>>>>>> ease of use in more free-form contexts (e.g. when producing records from >>>>>>> YAML and SQL). >>>>>>> >>>>>>> Also left some comments on the doc. >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 27, 2024 at 6:51 AM Ahmed Abualsaud via dev < >>>>>>> dev@beam.apache.org> wrote: >>>>>>> >>>>>>>> Hey all, >>>>>>>> >>>>>>>> There have been some conversations lately about how best to enable >>>>>>>> dynamic destinations in a portable context. Usually, this comes up for >>>>>>>> cross-language transforms and more recently for Beam YAML. >>>>>>>> >>>>>>>> I've started a short doc outlining some routes we could take. The >>>>>>>> purpose is to establish a good standard for supporting dynamic >>>>>>>> destinations >>>>>>>> with portability, one that can be applied to most use cases and IOs. >>>>>>>> Please >>>>>>>> take a look and add any thoughts! >>>>>>>> >>>>>>>> https://s.apache.org/portable-dynamic-destinations >>>>>>>> >>>>>>>> Best, >>>>>>>> Ahmed >>>>>>>> >>>>>>>