Re: Supporting Dynamic Destinations in a portable context

Kenneth Knowles Wed, 03 Apr 2024 04:15:49 -0700

Let me summarize the most recent proposal on-list to frame my question
about this last suggestion. It looks like this:


1. user has an element, call it `data`
2. user maps `data` to an arbitrary metadata row, call it `dest`
3. we can do things like shuffle on `dest` because it isn't too big
4. we map `dest` to a concrete destination (aka URL) to write to by a
string format that uses fields of `dest`

I believe steps 1-3 are identical is expressivity to non-portable
DynamicDestinations. So Reuven the question is for step 4: what are the
mappings from `dest` to URL that cannot be expressed by string formatting
but need SQL or Lua, etc? That would be a useful guide to consideration of
those possibilities.

FWIW I think even if we add a mini-language that string formatting has
better ease of use (can easily be displayed in UI, etc) so it would be the
first choice, and more advanced stuff is a fallback for rare cases. So they
are both valuable and I'd be happy to implement the easier-to-use path
right away while we discuss.

Kenn

On Tue, Apr 2, 2024 at 2:59 PM Reuven Lax via dev <[email protected]>
wrote:

> I do suspect that over time we'll find more and more cases we can't
> express, and will be asked to extend this little templating in more
> directions. To head that off - could we easily just reuse an existing
> language (SQL, LUA, something of the form?) instead of creating something
> new?
>
> On Tue, Apr 2, 2024 at 8:55 AM Kenneth Knowles <[email protected]> wrote:
>
>> I really like this proposal. I think it has narrowed down and solved the
>> essential problem of not shuffling excess redundant data, and also provides
>> the vast majority of the functionality that a lambda would, with
>> significantly better debugability and usability too, since the dynamic
>> destination pattern string can be in display data, etc.
>>
>> Kenn
>>
>> On Wed, Mar 27, 2024 at 1:58 PM Robert Bradshaw via dev <
>> [email protected]> wrote:
>>
>>> On Wed, Mar 27, 2024 at 10:20 AM Reuven Lax <[email protected]> wrote:
>>>
>>>> Can the prefix still be generated programmatically at graph creation
>>>> time?
>>>>
>>>
>>> Yes. It's just a property of the transform passed by the user at
>>> configuration time.
>>>
>>>
>>>> On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>> This does seem like the best compromise, though I think there will
>>>>>> still end up being performance issues. A common pattern I've seen is that
>>>>>> there is a long common prefix to the dynamic destination followed the
>>>>>> dynamic component. e.g. the destination might be
>>>>>> long/common/path/to/destination/files/<per-user-file>. In this case, the
>>>>>> prefix is often much larger than messages themselves and is what gets
>>>>>> effectively encoded in the lambda.
>>>>>>
>>>>>
>>>>> The idea here is that the destination would be given as a format
>>>>> string, say, "long/common/path/to/destination/files/{dest_info.user}".
>>>>> Another way to put this is that we support (only) "lambdas" that are
>>>>> represented as string substitutions. (The fact that dest_info does not 
>>>>> have
>>>>> to be part of the record, and can be the output of an arbitrary map if 
>>>>> need
>>>>> be, makes this restriction not so bad.)
>>>>>
>>>>> As well as solving the performance issues, I think this is actually a
>>>>> pretty convenient and natural way for the user to name their destination
>>>>> (for the common usecase, even easier than providing a lambda), and has the
>>>>> benefit of being much more transparent than an arbitrary callable as well
>>>>> for introspection (for both machine and human that may look at the
>>>>> resulting pipeline).
>>>>>
>>>>>
>>>>>> I'm not entirely sure how to address this in a portable context. We
>>>>>> might simply have to accept the extra overhead when going cross language.
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Wed, Mar 27, 2024 at 8:51 AM Robert Bradshaw via dev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Thanks for putting this together, it will be a really useful feature
>>>>>>> to have.
>>>>>>>
>>>>>>> I am in favor of the string-pattern approaches. I think we need to
>>>>>>> support both the {record=..., dest_info=...} and the elide-fields
>>>>>>> approaches, as the former is nicer when one has a fixed representation 
>>>>>>> for
>>>>>>> the output record (e.g. a proto or avro schema) and the flattened form 
>>>>>>> for
>>>>>>> ease of use in more free-form contexts (e.g. when producing records from
>>>>>>> YAML and SQL).
>>>>>>>
>>>>>>> Also left some comments on the doc.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 27, 2024 at 6:51 AM Ahmed Abualsaud via dev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hey all,
>>>>>>>>
>>>>>>>> There have been some conversations lately about how best to enable
>>>>>>>> dynamic destinations in a portable context. Usually, this comes up for
>>>>>>>> cross-language transforms and more recently for Beam YAML.
>>>>>>>>
>>>>>>>> I've started a short doc outlining some routes we could take. The
>>>>>>>> purpose is to establish a good standard for supporting dynamic 
>>>>>>>> destinations
>>>>>>>> with portability, one that can be applied to most use cases and IOs. 
>>>>>>>> Please
>>>>>>>> take a look and add any thoughts!
>>>>>>>>
>>>>>>>> https://s.apache.org/portable-dynamic-destinations
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Ahmed
>>>>>>>>
>>>>>>>

Re: Supporting Dynamic Destinations in a portable context

Reply via email to