Hey everyone,

During PR reviews, we saw that there are multiple valid alternatives for a
dynamic destinations configuration with regards to naming and positioning.

@Robert Bradshaw <rober...@google.com> put together a doc [1] enumerating
some of these options -- we hope to get some feedback on what looks best to
you!

[1]
https://docs.google.com/document/d/1IIn4cjF9eYASnjSmVmmAt6ymFnpBxHgBKVPgpnQ12G4/edit?usp=sharing

On Thu, Aug 29, 2024 at 4:15 PM Ahmed Abualsaud <ahmedabuals...@google.com>
wrote:

> Big thanks to everyone for the ongoing discussions in this thread and on
> the doc!
>
> The implementation to enable portable dynamic destinations is now underway
> - GitHub tracker: https://github.com/apache/beam/issues/32365
>
> Best,
> Ahmed
>
> On Wed, Apr 3, 2024 at 1:00 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Wed, Apr 3, 2024 at 4:15 AM Kenneth Knowles <k...@apache.org> wrote:
>>
>>> Let me summarize the most recent proposal on-list to frame my question
>>> about this last suggestion. It looks like this:
>>>
>>> 1. user has an element, call it `data`
>>> 2. user maps `data` to an arbitrary metadata row, call it `dest`
>>> 3. we can do things like shuffle on `dest` because it isn't too big
>>> 4. we map `dest` to a concrete destination (aka URL) to write to by a
>>> string format that uses fields of `dest`
>>>
>>> I believe steps 1-3 are identical is expressivity to non-portable
>>> DynamicDestinations. So Reuven the question is for step 4: what are the
>>> mappings from `dest` to URL that cannot be expressed by string formatting
>>> but need SQL or Lua, etc? That would be a useful guide to consideration of
>>> those possibilities.
>>>
>>
>> I think any non-trivial mapping can be done in step 2. It may be possible
>> to come up with a case where something other than string substitution is
>> needed to be done to make dest small enough to shuffle, but I think that'd
>> be a really rare corner case, and then it's just an optimization rather
>> than feature completeness question.
>>
>>
>>> FWIW I think even if we add a mini-language that string formatting has
>>> better ease of use (can easily be displayed in UI, etc) so it would be the
>>> first choice, and more advanced stuff is a fallback for rare cases. So they
>>> are both valuable and I'd be happy to implement the easier-to-use path
>>> right away while we discuss.
>>>
>>
>> +1. Note that this even lets us share the config "path/table/..." field
>> that is a static string for non-dynamic destinations.
>>
>> In light of the above, let's avoid a complex mini-language. I'd start
>> with nothing but plugging things in w/o any formatting options.
>>
>>
>>> On Tue, Apr 2, 2024 at 2:59 PM Reuven Lax via dev <dev@beam.apache.org>
>>> wrote:
>>>
>>>> I do suspect that over time we'll find more and more cases we can't
>>>> express, and will be asked to extend this little templating in more
>>>> directions. To head that off - could we easily just reuse an existing
>>>> language (SQL, LUA, something of the form?) instead of creating something
>>>> new?
>>>>
>>>> On Tue, Apr 2, 2024 at 8:55 AM Kenneth Knowles <k...@apache.org> wrote:
>>>>
>>>>> I really like this proposal. I think it has narrowed down and solved
>>>>> the essential problem of not shuffling excess redundant data, and also
>>>>> provides the vast majority of the functionality that a lambda would, with
>>>>> significantly better debugability and usability too, since the dynamic
>>>>> destination pattern string can be in display data, etc.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Wed, Mar 27, 2024 at 1:58 PM Robert Bradshaw via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> On Wed, Mar 27, 2024 at 10:20 AM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> Can the prefix still be generated programmatically at graph creation
>>>>>>> time?
>>>>>>>
>>>>>>
>>>>>> Yes. It's just a property of the transform passed by the user at
>>>>>> configuration time.
>>>>>>
>>>>>>
>>>>>>> On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw <rober...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax <re...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This does seem like the best compromise, though I think there will
>>>>>>>>> still end up being performance issues. A common pattern I've seen is 
>>>>>>>>> that
>>>>>>>>> there is a long common prefix to the dynamic destination followed the
>>>>>>>>> dynamic component. e.g. the destination might be
>>>>>>>>> long/common/path/to/destination/files/<per-user-file>. In this case, 
>>>>>>>>> the
>>>>>>>>> prefix is often much larger than messages themselves and is what gets
>>>>>>>>> effectively encoded in the lambda.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The idea here is that the destination would be given as a format
>>>>>>>> string, say, "long/common/path/to/destination/files/{dest_info.user}".
>>>>>>>> Another way to put this is that we support (only) "lambdas" that are
>>>>>>>> represented as string substitutions. (The fact that dest_info does not 
>>>>>>>> have
>>>>>>>> to be part of the record, and can be the output of an arbitrary map if 
>>>>>>>> need
>>>>>>>> be, makes this restriction not so bad.)
>>>>>>>>
>>>>>>>> As well as solving the performance issues, I think this is actually
>>>>>>>> a pretty convenient and natural way for the user to name their 
>>>>>>>> destination
>>>>>>>> (for the common usecase, even easier than providing a lambda), and has 
>>>>>>>> the
>>>>>>>> benefit of being much more transparent than an arbitrary callable as 
>>>>>>>> well
>>>>>>>> for introspection (for both machine and human that may look at the
>>>>>>>> resulting pipeline).
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm not entirely sure how to address this in a portable context.
>>>>>>>>> We might simply have to accept the extra overhead when going cross 
>>>>>>>>> language.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev <
>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for putting this together, it will be a really
>>>>>>>>>> useful feature to have.
>>>>>>>>>>
>>>>>>>>>> I am in favor of the string-pattern approaches. I think we need
>>>>>>>>>> to support both the {record=..., dest_info=...} and the elide-fields
>>>>>>>>>> approaches, as the former is nicer when one has a fixed 
>>>>>>>>>> representation for
>>>>>>>>>> the output record (e.g. a proto or avro schema) and the flattened 
>>>>>>>>>> form for
>>>>>>>>>> ease of use in more free-form contexts (e.g. when producing records 
>>>>>>>>>> from
>>>>>>>>>> YAML and SQL).
>>>>>>>>>>
>>>>>>>>>> Also left some comments on the doc.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 27, 2024 at 6:51 AM Ahmed Abualsaud via dev <
>>>>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey all,
>>>>>>>>>>>
>>>>>>>>>>> There have been some conversations lately about how best to
>>>>>>>>>>> enable dynamic destinations in a portable context. Usually, this 
>>>>>>>>>>> comes up
>>>>>>>>>>> for cross-language transforms and more recently for Beam YAML.
>>>>>>>>>>>
>>>>>>>>>>> I've started a short doc outlining some routes we could take.
>>>>>>>>>>> The purpose is to establish a good standard for supporting dynamic
>>>>>>>>>>> destinations with portability, one that can be applied to most use 
>>>>>>>>>>> cases
>>>>>>>>>>> and IOs. Please take a look and add any thoughts!
>>>>>>>>>>>
>>>>>>>>>>> https://s.apache.org/portable-dynamic-destinations
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Ahmed
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to