Re: Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2023-01-25 Thread Jorge Esteban Quilcate Otoya
Hi there,

Bumping this thread for visibility.

Cheers,
Jorge

On Fri, 2 Sep 2022, 18:01 Chris Egerton,  wrote:

> Hi Jorge,
>
> One tiny nit, but LGTM otherwise:
>
> The KIP mentions backslashes as "(/)"; shouldn't this be "(\)"?
>
> I'll cast a +1 on the vote thread anyways; I'm sure this won't block us.
>
> Cheers, and thanks for all your hard work on this!
>
> Chris
>
> On Thu, Sep 1, 2022 at 1:33 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Thanks for your feedback!
> >
> > 1. Yes, it will be context-dependent. I have added rules and scenarios to
> > the nested notation to cover the happy path and edge cases. In short,
> > backticks will be not be considered as part of the field name when they
> are
> > wrapping a field name: first backtick at the beginning of the path or
> after
> > a dot, and closing backtick before a dot or at the end of the path.
> > If backticks happen to be in those positions, use backslash to escape
> them.
> > 2. You're right, that's a typo. Fixing it.
> > 3. I don't think so, I have added a scenario to clarify this.
> >
> > KIP is updated. Hopefully the rules and scenarios help to close any open
> > gap. Let me know if you see any cases that is not considered to address
> it.
> >
> > Cheers,
> > Jorge.
> >
> > On Wed, 31 Aug 2022 at 20:02, Chris Egerton 
> > wrote:
> >
> > > Hi Robert and Jorge,
> > >
> > > I think the backtick/backslash proposal works, but I'm a little unclear
> > on
> > > some of the details:
> > >
> > > 1. Are backticks only given special treatment when they immediately
> > follow
> > > a non-escaped dot? E.g., "foo.b`ar.ba`z" would refer to "foo" ->
> "b`ar"
> > ->
> > > "ba`z" instead of "foo" -> "bar.baz"? Based on the example where the
> name
> > > "a.b`.c" refers to "a" -> "b`" -> "c", it seems like this is the case,
> > but
> > > I'm worried this might cause confusion since the role of the backtick
> and
> > > the need to escape it becomes context-dependent.
> > >
> > > 2. In the example table near the beginning of the KIP, the name
> > "a.`b\`.c`"
> > > refers to "a" -> "b`c". What happened to the dot in the second part of
> > the
> > > name? Should it refer to "a" -> "b`.c" instead?
> > >
> > > 3. Is it ever necessary to escape backslashes themselves? If so, when?
> > >
> > > Overall, I wish we could come up with a prettier/simpler approach, but
> > the
> > > benefits provided by the dual backtick/dot syntax are too great to
> deny:
> > > there are no correctness issues like the ones posed with double-dot
> > > escaping that would lead to ambiguity, the most common cases are still
> > very
> > > simple to work with, and there's no risk of interfering with JSON
> escape
> > > mechanisms (in most cases) or single-quote shell quoting (which may be
> > > relevant when connector configurations are defined on the command
> line).
> > > Thanks for the suggestion, Robert!
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> >
>


Re: Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-09-02 Thread Chris Egerton
Hi Jorge,

One tiny nit, but LGTM otherwise:

The KIP mentions backslashes as "(/)"; shouldn't this be "(\)"?

I'll cast a +1 on the vote thread anyways; I'm sure this won't block us.

Cheers, and thanks for all your hard work on this!

Chris

On Thu, Sep 1, 2022 at 1:33 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Hi Chris,
>
> Thanks for your feedback!
>
> 1. Yes, it will be context-dependent. I have added rules and scenarios to
> the nested notation to cover the happy path and edge cases. In short,
> backticks will be not be considered as part of the field name when they are
> wrapping a field name: first backtick at the beginning of the path or after
> a dot, and closing backtick before a dot or at the end of the path.
> If backticks happen to be in those positions, use backslash to escape them.
> 2. You're right, that's a typo. Fixing it.
> 3. I don't think so, I have added a scenario to clarify this.
>
> KIP is updated. Hopefully the rules and scenarios help to close any open
> gap. Let me know if you see any cases that is not considered to address it.
>
> Cheers,
> Jorge.
>
> On Wed, 31 Aug 2022 at 20:02, Chris Egerton 
> wrote:
>
> > Hi Robert and Jorge,
> >
> > I think the backtick/backslash proposal works, but I'm a little unclear
> on
> > some of the details:
> >
> > 1. Are backticks only given special treatment when they immediately
> follow
> > a non-escaped dot? E.g., "foo.b`ar.ba`z" would refer to "foo" -> "b`ar"
> ->
> > "ba`z" instead of "foo" -> "bar.baz"? Based on the example where the name
> > "a.b`.c" refers to "a" -> "b`" -> "c", it seems like this is the case,
> but
> > I'm worried this might cause confusion since the role of the backtick and
> > the need to escape it becomes context-dependent.
> >
> > 2. In the example table near the beginning of the KIP, the name
> "a.`b\`.c`"
> > refers to "a" -> "b`c". What happened to the dot in the second part of
> the
> > name? Should it refer to "a" -> "b`.c" instead?
> >
> > 3. Is it ever necessary to escape backslashes themselves? If so, when?
> >
> > Overall, I wish we could come up with a prettier/simpler approach, but
> the
> > benefits provided by the dual backtick/dot syntax are too great to deny:
> > there are no correctness issues like the ones posed with double-dot
> > escaping that would lead to ambiguity, the most common cases are still
> very
> > simple to work with, and there's no risk of interfering with JSON escape
> > mechanisms (in most cases) or single-quote shell quoting (which may be
> > relevant when connector configurations are defined on the command line).
> > Thanks for the suggestion, Robert!
> >
> > Cheers,
> >
> > Chris
> >
>


Re: Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-09-01 Thread Jorge Esteban Quilcate Otoya
Hi Chris,

Thanks for your feedback!

1. Yes, it will be context-dependent. I have added rules and scenarios to
the nested notation to cover the happy path and edge cases. In short,
backticks will be not be considered as part of the field name when they are
wrapping a field name: first backtick at the beginning of the path or after
a dot, and closing backtick before a dot or at the end of the path.
If backticks happen to be in those positions, use backslash to escape them.
2. You're right, that's a typo. Fixing it.
3. I don't think so, I have added a scenario to clarify this.

KIP is updated. Hopefully the rules and scenarios help to close any open
gap. Let me know if you see any cases that is not considered to address it.

Cheers,
Jorge.

On Wed, 31 Aug 2022 at 20:02, Chris Egerton  wrote:

> Hi Robert and Jorge,
>
> I think the backtick/backslash proposal works, but I'm a little unclear on
> some of the details:
>
> 1. Are backticks only given special treatment when they immediately follow
> a non-escaped dot? E.g., "foo.b`ar.ba`z" would refer to "foo" -> "b`ar" ->
> "ba`z" instead of "foo" -> "bar.baz"? Based on the example where the name
> "a.b`.c" refers to "a" -> "b`" -> "c", it seems like this is the case, but
> I'm worried this might cause confusion since the role of the backtick and
> the need to escape it becomes context-dependent.
>
> 2. In the example table near the beginning of the KIP, the name "a.`b\`.c`"
> refers to "a" -> "b`c". What happened to the dot in the second part of the
> name? Should it refer to "a" -> "b`.c" instead?
>
> 3. Is it ever necessary to escape backslashes themselves? If so, when?
>
> Overall, I wish we could come up with a prettier/simpler approach, but the
> benefits provided by the dual backtick/dot syntax are too great to deny:
> there are no correctness issues like the ones posed with double-dot
> escaping that would lead to ambiguity, the most common cases are still very
> simple to work with, and there's no risk of interfering with JSON escape
> mechanisms (in most cases) or single-quote shell quoting (which may be
> relevant when connector configurations are defined on the command line).
> Thanks for the suggestion, Robert!
>
> Cheers,
>
> Chris
>


RE: Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-08-31 Thread Chris Egerton
Hi Robert and Jorge,

I think the backtick/backslash proposal works, but I'm a little unclear on
some of the details:

1. Are backticks only given special treatment when they immediately follow
a non-escaped dot? E.g., "foo.b`ar.ba`z" would refer to "foo" -> "b`ar" ->
"ba`z" instead of "foo" -> "bar.baz"? Based on the example where the name
"a.b`.c" refers to "a" -> "b`" -> "c", it seems like this is the case, but
I'm worried this might cause confusion since the role of the backtick and
the need to escape it becomes context-dependent.

2. In the example table near the beginning of the KIP, the name "a.`b\`.c`"
refers to "a" -> "b`c". What happened to the dot in the second part of the
name? Should it refer to "a" -> "b`.c" instead?

3. Is it ever necessary to escape backslashes themselves? If so, when?

Overall, I wish we could come up with a prettier/simpler approach, but the
benefits provided by the dual backtick/dot syntax are too great to deny:
there are no correctness issues like the ones posed with double-dot
escaping that would lead to ambiguity, the most common cases are still very
simple to work with, and there's no risk of interfering with JSON escape
mechanisms (in most cases) or single-quote shell quoting (which may be
relevant when connector configurations are defined on the command line).
Thanks for the suggestion, Robert!

Cheers,

Chris


Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-08-12 Thread Jorge Esteban Quilcate Otoya
Awesome, thanks Robert!
Will work on updated the proposal around this. Also, @Chris Egerton
 if you have any feedback about this, would be
much appreciated.

Cheers,
Jorge.

On Fri, 12 Aug 2022 at 18:45, Robert Yokota  wrote:

> Hi Jorge,
>
> Yes, to escape a backtick I would recommend a backslash.
>
> For example, MySQL allows identifiers to be surrounded with single quotes,
> double quotes, or backticks, and they use backslash to escape.
>
> Jsonata went with primarily backticks as they are less common in strings
> that appear in JSON, and it seems to have worked nicely for them.
>
> Regards,
> Robert
>
> On Fri, Aug 12, 2022 at 10:01 AM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks Robert! And sorry for the late reply.
> >
> > That's a great catch and will require the current proposal to be updated.
> >
> > Using back-ticks is a good proposal. Though, the challenge we got is that
> > JSON allows really any character into attribute names, including
> > back-ticks.
> > I wonder what's the best approach to escape back-ticks then. Given that
> > those use-cases should be minimal we could fallback to backslash to
> escape
> > back-ticks?
> >
> > What do you think?
> >
> > Many thanks,
> > Jorge.
> >
> >
> > On Wed, 20 Jul 2022 at 23:03, Robert Yokota  wrote:
> >
> > > Hi,
> > >
> > > I'm late to this thread, but would like to comment on the double
> > dot/double
> > > asterisk syntax.
> > >
> > > Unfortunately double dot is often used in JSON Path as a descendant
> > > selector, see https://www.ietf.org/id/draft-ietf-jsonpath-base-05.html
> > >
> > > I think a better notation would be to use backticks to quote any names
> > with
> > > a dot, such as  root.`child.with.dot`.  See Jsonata syntax for example:
> > > http://docs.jsonata.org/simple
> > >
> > > Robert
> > >
> > > On Wed, Jun 29, 2022 at 3:34 AM Jorge Esteban Quilcate Otoya <
> > > quilcate.jo...@gmail.com> wrote:
> > >
> > > > Thanks Chris! I have updated the KIP to include this fix.
> > > >
> > > > I will keep the array as a potential improvement at the moment, and
> out
> > > of
> > > > the scope of this KIP.
> > > >
> > > > Thanks,
> > > > Jorge.
> > > >
> > > > On Tue, 28 Jun 2022 at 23:19, Chris Egerton  >
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Apologies for the long delay, had my own KIP-related work to focus
> > on.
> > > > >
> > > > > I think it's fine to include array accesses but it's not a blocker.
> > I'm
> > > > +1
> > > > > either way. On that front though, I think the MaskField section
> might
> > > > need
> > > > > to be updated as it still mentions arrays and deep-scan?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Tue, Jun 28, 2022 at 4:38 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jo...@gmail.com> wrote:
> > > > >
> > > > > > Hi there,
> > > > > >
> > > > > > I have update the KIP to the previous state voted, including the
> > > > > > configuration change from `field.style` to
> `field.syntax.version`.
> > > > > >
> > > > > > I'll bump the vote thread as well to see if there's agreement on
> > > adding
> > > > > > this feature to Connect.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > > > On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jo...@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks, Chris. Your feedback is much appreciated!
> > > > > > >
> > > > > > > I see how the current proposal might be underestimating some
> edge
> > > > > cases.
> > > > > > > I'm happy to move the design for deep-scan and multi-values to
> > > future
> > > > > > > developments related with this KIP and reduce its scope, though
> > > open
> > > > > for
> > > > > > > more feedback.
> > > > > > >
> > > > > > > Also, just to be sure, are you proposing also to not include
> > array
> > > > > access
> > > > > > > at this stage?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jorge.
> > > > > > >
> > > > > > > On Tue, 14 Jun 2022 at 03:20, Chris Egerton <
> > > fearthecel...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Hi Jorge,
> > > > > > >>
> > > > > > >> I've done some more thinking and I hate to say it, but I think
> > the
> > > > > > syntax
> > > > > > >> does need to be expanded. Right now it's clear what "a.b"
> refers
> > > to
> > > > > and
> > > > > > >> what "a..b" refers to, but what about "a...b"? Is that
> referring
> > > to
> > > > > > >> subfield ".b" of field "a", or subfield "b" of field "a."?
> This
> > > gets
> > > > > > even
> > > > > > >> more complicated when thinking about fields whose names are
> > > > > exclusively
> > > > > > >> made up of dots.
> > > > > > >>
> > > > > > >> I'm also a little hesitant to mix the cases of multi-value
> paths
> > > and
> > > > > > deep
> > > > > > >> scans. What if you only want to access one subfield deep for
> an
> > > SMT,
> > > > > > >> instead of recursing through all the children of a given
> field?
> > > It's

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-08-12 Thread Robert Yokota
Hi Jorge,

Yes, to escape a backtick I would recommend a backslash.

For example, MySQL allows identifiers to be surrounded with single quotes,
double quotes, or backticks, and they use backslash to escape.

Jsonata went with primarily backticks as they are less common in strings
that appear in JSON, and it seems to have worked nicely for them.

Regards,
Robert

On Fri, Aug 12, 2022 at 10:01 AM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks Robert! And sorry for the late reply.
>
> That's a great catch and will require the current proposal to be updated.
>
> Using back-ticks is a good proposal. Though, the challenge we got is that
> JSON allows really any character into attribute names, including
> back-ticks.
> I wonder what's the best approach to escape back-ticks then. Given that
> those use-cases should be minimal we could fallback to backslash to escape
> back-ticks?
>
> What do you think?
>
> Many thanks,
> Jorge.
>
>
> On Wed, 20 Jul 2022 at 23:03, Robert Yokota  wrote:
>
> > Hi,
> >
> > I'm late to this thread, but would like to comment on the double
> dot/double
> > asterisk syntax.
> >
> > Unfortunately double dot is often used in JSON Path as a descendant
> > selector, see https://www.ietf.org/id/draft-ietf-jsonpath-base-05.html
> >
> > I think a better notation would be to use backticks to quote any names
> with
> > a dot, such as  root.`child.with.dot`.  See Jsonata syntax for example:
> > http://docs.jsonata.org/simple
> >
> > Robert
> >
> > On Wed, Jun 29, 2022 at 3:34 AM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thanks Chris! I have updated the KIP to include this fix.
> > >
> > > I will keep the array as a potential improvement at the moment, and out
> > of
> > > the scope of this KIP.
> > >
> > > Thanks,
> > > Jorge.
> > >
> > > On Tue, 28 Jun 2022 at 23:19, Chris Egerton 
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Apologies for the long delay, had my own KIP-related work to focus
> on.
> > > >
> > > > I think it's fine to include array accesses but it's not a blocker.
> I'm
> > > +1
> > > > either way. On that front though, I think the MaskField section might
> > > need
> > > > to be updated as it still mentions arrays and deep-scan?
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Jun 28, 2022 at 4:38 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jo...@gmail.com> wrote:
> > > >
> > > > > Hi there,
> > > > >
> > > > > I have update the KIP to the previous state voted, including the
> > > > > configuration change from `field.style` to `field.syntax.version`.
> > > > >
> > > > > I'll bump the vote thread as well to see if there's agreement on
> > adding
> > > > > this feature to Connect.
> > > > >
> > > > > Cheers,
> > > > > Jorge.
> > > > >
> > > > > On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jo...@gmail.com> wrote:
> > > > >
> > > > > > Thanks, Chris. Your feedback is much appreciated!
> > > > > >
> > > > > > I see how the current proposal might be underestimating some edge
> > > > cases.
> > > > > > I'm happy to move the design for deep-scan and multi-values to
> > future
> > > > > > developments related with this KIP and reduce its scope, though
> > open
> > > > for
> > > > > > more feedback.
> > > > > >
> > > > > > Also, just to be sure, are you proposing also to not include
> array
> > > > access
> > > > > > at this stage?
> > > > > >
> > > > > > Thanks,
> > > > > > Jorge.
> > > > > >
> > > > > > On Tue, 14 Jun 2022 at 03:20, Chris Egerton <
> > fearthecel...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Jorge,
> > > > > >>
> > > > > >> I've done some more thinking and I hate to say it, but I think
> the
> > > > > syntax
> > > > > >> does need to be expanded. Right now it's clear what "a.b" refers
> > to
> > > > and
> > > > > >> what "a..b" refers to, but what about "a...b"? Is that referring
> > to
> > > > > >> subfield ".b" of field "a", or subfield "b" of field "a."? This
> > gets
> > > > > even
> > > > > >> more complicated when thinking about fields whose names are
> > > > exclusively
> > > > > >> made up of dots.
> > > > > >>
> > > > > >> I'm also a little hesitant to mix the cases of multi-value paths
> > and
> > > > > deep
> > > > > >> scans. What if you only want to access one subfield deep for an
> > SMT,
> > > > > >> instead of recursing through all the children of a given field?
> > It's
> > > > > akin
> > > > > >> to the distinction between * and ** with file globbing patterns,
> > and
> > > > > there
> > > > > >> could be a substantial performance difference if you have
> > > > heavily-nested
> > > > > >> fields.
> > > > > >>
> > > > > >> Ultimately, I think that if the proposed "field.syntax.version"
> > > > property
> > > > > >> sits well with people, it might be better to reduce the scope of
> > the
> > > > KIP
> > > > > >> back to the original proposal and just focus on adding support
> for
> > > > > >> 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-08-12 Thread Jorge Esteban Quilcate Otoya
Thanks Robert! And sorry for the late reply.

That's a great catch and will require the current proposal to be updated.

Using back-ticks is a good proposal. Though, the challenge we got is that
JSON allows really any character into attribute names, including back-ticks.
I wonder what's the best approach to escape back-ticks then. Given that
those use-cases should be minimal we could fallback to backslash to escape
back-ticks?

What do you think?

Many thanks,
Jorge.


On Wed, 20 Jul 2022 at 23:03, Robert Yokota  wrote:

> Hi,
>
> I'm late to this thread, but would like to comment on the double dot/double
> asterisk syntax.
>
> Unfortunately double dot is often used in JSON Path as a descendant
> selector, see https://www.ietf.org/id/draft-ietf-jsonpath-base-05.html
>
> I think a better notation would be to use backticks to quote any names with
> a dot, such as  root.`child.with.dot`.  See Jsonata syntax for example:
> http://docs.jsonata.org/simple
>
> Robert
>
> On Wed, Jun 29, 2022 at 3:34 AM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks Chris! I have updated the KIP to include this fix.
> >
> > I will keep the array as a potential improvement at the moment, and out
> of
> > the scope of this KIP.
> >
> > Thanks,
> > Jorge.
> >
> > On Tue, 28 Jun 2022 at 23:19, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Apologies for the long delay, had my own KIP-related work to focus on.
> > >
> > > I think it's fine to include array accesses but it's not a blocker. I'm
> > +1
> > > either way. On that front though, I think the MaskField section might
> > need
> > > to be updated as it still mentions arrays and deep-scan?
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Jun 28, 2022 at 4:38 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jo...@gmail.com> wrote:
> > >
> > > > Hi there,
> > > >
> > > > I have update the KIP to the previous state voted, including the
> > > > configuration change from `field.style` to `field.syntax.version`.
> > > >
> > > > I'll bump the vote thread as well to see if there's agreement on
> adding
> > > > this feature to Connect.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > > On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
> > > > quilcate.jo...@gmail.com> wrote:
> > > >
> > > > > Thanks, Chris. Your feedback is much appreciated!
> > > > >
> > > > > I see how the current proposal might be underestimating some edge
> > > cases.
> > > > > I'm happy to move the design for deep-scan and multi-values to
> future
> > > > > developments related with this KIP and reduce its scope, though
> open
> > > for
> > > > > more feedback.
> > > > >
> > > > > Also, just to be sure, are you proposing also to not include array
> > > access
> > > > > at this stage?
> > > > >
> > > > > Thanks,
> > > > > Jorge.
> > > > >
> > > > > On Tue, 14 Jun 2022 at 03:20, Chris Egerton <
> fearthecel...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > >> Hi Jorge,
> > > > >>
> > > > >> I've done some more thinking and I hate to say it, but I think the
> > > > syntax
> > > > >> does need to be expanded. Right now it's clear what "a.b" refers
> to
> > > and
> > > > >> what "a..b" refers to, but what about "a...b"? Is that referring
> to
> > > > >> subfield ".b" of field "a", or subfield "b" of field "a."? This
> gets
> > > > even
> > > > >> more complicated when thinking about fields whose names are
> > > exclusively
> > > > >> made up of dots.
> > > > >>
> > > > >> I'm also a little hesitant to mix the cases of multi-value paths
> and
> > > > deep
> > > > >> scans. What if you only want to access one subfield deep for an
> SMT,
> > > > >> instead of recursing through all the children of a given field?
> It's
> > > > akin
> > > > >> to the distinction between * and ** with file globbing patterns,
> and
> > > > there
> > > > >> could be a substantial performance difference if you have
> > > heavily-nested
> > > > >> fields.
> > > > >>
> > > > >> Ultimately, I think that if the proposed "field.syntax.version"
> > > property
> > > > >> sits well with people, it might be better to reduce the scope of
> the
> > > KIP
> > > > >> back to the original proposal and just focus on adding support for
> > > > >> explicitly-specified nested values, with no multi-value paths
> > > > whatsoever,
> > > > >> knowing that we have an easy way to introduce new syntax and
> > features
> > > in
> > > > >> the future. (We could probably leave the "a...b" case for that
> next
> > > > >> version
> > > > >> too.)
> > > > >>
> > > > >> I was a huge fan of this KIP before we started trying to address
> > more
> > > > >> complex use cases, and although I don't want to write those off, I
> > > think
> > > > >> we
> > > > >> may have bitten off more than we can chew in time for the 3.3.0
> > > release
> > > > >> and
> > > > >> would hate to see this KIP get delayed as a result.
> > > > >>
> > > > >> I'd be really curious to hear from Joshua and Tom on this front,
> > > though.

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-07-20 Thread Robert Yokota
Hi,

I'm late to this thread, but would like to comment on the double dot/double
asterisk syntax.

Unfortunately double dot is often used in JSON Path as a descendant
selector, see https://www.ietf.org/id/draft-ietf-jsonpath-base-05.html

I think a better notation would be to use backticks to quote any names with
a dot, such as  root.`child.with.dot`.  See Jsonata syntax for example:
http://docs.jsonata.org/simple

Robert

On Wed, Jun 29, 2022 at 3:34 AM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks Chris! I have updated the KIP to include this fix.
>
> I will keep the array as a potential improvement at the moment, and out of
> the scope of this KIP.
>
> Thanks,
> Jorge.
>
> On Tue, 28 Jun 2022 at 23:19, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Apologies for the long delay, had my own KIP-related work to focus on.
> >
> > I think it's fine to include array accesses but it's not a blocker. I'm
> +1
> > either way. On that front though, I think the MaskField section might
> need
> > to be updated as it still mentions arrays and deep-scan?
> >
> > Cheers,
> >
> > Chris
> >
> > On Tue, Jun 28, 2022 at 4:38 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > I have update the KIP to the previous state voted, including the
> > > configuration change from `field.style` to `field.syntax.version`.
> > >
> > > I'll bump the vote thread as well to see if there's agreement on adding
> > > this feature to Connect.
> > >
> > > Cheers,
> > > Jorge.
> > >
> > > On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
> > > quilcate.jo...@gmail.com> wrote:
> > >
> > > > Thanks, Chris. Your feedback is much appreciated!
> > > >
> > > > I see how the current proposal might be underestimating some edge
> > cases.
> > > > I'm happy to move the design for deep-scan and multi-values to future
> > > > developments related with this KIP and reduce its scope, though open
> > for
> > > > more feedback.
> > > >
> > > > Also, just to be sure, are you proposing also to not include array
> > access
> > > > at this stage?
> > > >
> > > > Thanks,
> > > > Jorge.
> > > >
> > > > On Tue, 14 Jun 2022 at 03:20, Chris Egerton  >
> > > > wrote:
> > > >
> > > >> Hi Jorge,
> > > >>
> > > >> I've done some more thinking and I hate to say it, but I think the
> > > syntax
> > > >> does need to be expanded. Right now it's clear what "a.b" refers to
> > and
> > > >> what "a..b" refers to, but what about "a...b"? Is that referring to
> > > >> subfield ".b" of field "a", or subfield "b" of field "a."? This gets
> > > even
> > > >> more complicated when thinking about fields whose names are
> > exclusively
> > > >> made up of dots.
> > > >>
> > > >> I'm also a little hesitant to mix the cases of multi-value paths and
> > > deep
> > > >> scans. What if you only want to access one subfield deep for an SMT,
> > > >> instead of recursing through all the children of a given field? It's
> > > akin
> > > >> to the distinction between * and ** with file globbing patterns, and
> > > there
> > > >> could be a substantial performance difference if you have
> > heavily-nested
> > > >> fields.
> > > >>
> > > >> Ultimately, I think that if the proposed "field.syntax.version"
> > property
> > > >> sits well with people, it might be better to reduce the scope of the
> > KIP
> > > >> back to the original proposal and just focus on adding support for
> > > >> explicitly-specified nested values, with no multi-value paths
> > > whatsoever,
> > > >> knowing that we have an easy way to introduce new syntax and
> features
> > in
> > > >> the future. (We could probably leave the "a...b" case for that next
> > > >> version
> > > >> too.)
> > > >>
> > > >> I was a huge fan of this KIP before we started trying to address
> more
> > > >> complex use cases, and although I don't want to write those off, I
> > think
> > > >> we
> > > >> may have bitten off more than we can chew in time for the 3.3.0
> > release
> > > >> and
> > > >> would hate to see this KIP get delayed as a result.
> > > >>
> > > >> I'd be really curious to hear from Joshua and Tom on this front,
> > though.
> > > >> Is
> > > >> it acceptable to move more incrementally here and settle on the
> syntax
> > > >> version property as our means of introducing new features, or is it
> > > >> preferable to implement things monolithically and try to get
> > everything
> > > >> (or
> > > >> at least, as much as possible) right the first time?
> > > >>
> > > >> Thanks again for your continued effort on this KIP!
> > > >>
> > > >> Cheers,
> > > >>
> > > >> Chris
> > > >>
> > > >> On Wed, Jun 8, 2022 at 5:41 PM Jorge Esteban Quilcate Otoya <
> > > >> quilcate.jo...@gmail.com> wrote:
> > > >>
> > > >> > Thanks, Chris!
> > > >> >
> > > >> > Please, find my comments below:
> > > >> >
> > > >> > On Tue, 7 Jun 2022 at 04:39, Chris Egerton <
> fearthecel...@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Jorge,
> > > >> > >
> > > >> 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-29 Thread Jorge Esteban Quilcate Otoya
Thanks Chris! I have updated the KIP to include this fix.

I will keep the array as a potential improvement at the moment, and out of
the scope of this KIP.

Thanks,
Jorge.

On Tue, 28 Jun 2022 at 23:19, Chris Egerton  wrote:

> Hi Jorge,
>
> Apologies for the long delay, had my own KIP-related work to focus on.
>
> I think it's fine to include array accesses but it's not a blocker. I'm +1
> either way. On that front though, I think the MaskField section might need
> to be updated as it still mentions arrays and deep-scan?
>
> Cheers,
>
> Chris
>
> On Tue, Jun 28, 2022 at 4:38 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Hi there,
> >
> > I have update the KIP to the previous state voted, including the
> > configuration change from `field.style` to `field.syntax.version`.
> >
> > I'll bump the vote thread as well to see if there's agreement on adding
> > this feature to Connect.
> >
> > Cheers,
> > Jorge.
> >
> > On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thanks, Chris. Your feedback is much appreciated!
> > >
> > > I see how the current proposal might be underestimating some edge
> cases.
> > > I'm happy to move the design for deep-scan and multi-values to future
> > > developments related with this KIP and reduce its scope, though open
> for
> > > more feedback.
> > >
> > > Also, just to be sure, are you proposing also to not include array
> access
> > > at this stage?
> > >
> > > Thanks,
> > > Jorge.
> > >
> > > On Tue, 14 Jun 2022 at 03:20, Chris Egerton 
> > > wrote:
> > >
> > >> Hi Jorge,
> > >>
> > >> I've done some more thinking and I hate to say it, but I think the
> > syntax
> > >> does need to be expanded. Right now it's clear what "a.b" refers to
> and
> > >> what "a..b" refers to, but what about "a...b"? Is that referring to
> > >> subfield ".b" of field "a", or subfield "b" of field "a."? This gets
> > even
> > >> more complicated when thinking about fields whose names are
> exclusively
> > >> made up of dots.
> > >>
> > >> I'm also a little hesitant to mix the cases of multi-value paths and
> > deep
> > >> scans. What if you only want to access one subfield deep for an SMT,
> > >> instead of recursing through all the children of a given field? It's
> > akin
> > >> to the distinction between * and ** with file globbing patterns, and
> > there
> > >> could be a substantial performance difference if you have
> heavily-nested
> > >> fields.
> > >>
> > >> Ultimately, I think that if the proposed "field.syntax.version"
> property
> > >> sits well with people, it might be better to reduce the scope of the
> KIP
> > >> back to the original proposal and just focus on adding support for
> > >> explicitly-specified nested values, with no multi-value paths
> > whatsoever,
> > >> knowing that we have an easy way to introduce new syntax and features
> in
> > >> the future. (We could probably leave the "a...b" case for that next
> > >> version
> > >> too.)
> > >>
> > >> I was a huge fan of this KIP before we started trying to address more
> > >> complex use cases, and although I don't want to write those off, I
> think
> > >> we
> > >> may have bitten off more than we can chew in time for the 3.3.0
> release
> > >> and
> > >> would hate to see this KIP get delayed as a result.
> > >>
> > >> I'd be really curious to hear from Joshua and Tom on this front,
> though.
> > >> Is
> > >> it acceptable to move more incrementally here and settle on the syntax
> > >> version property as our means of introducing new features, or is it
> > >> preferable to implement things monolithically and try to get
> everything
> > >> (or
> > >> at least, as much as possible) right the first time?
> > >>
> > >> Thanks again for your continued effort on this KIP!
> > >>
> > >> Cheers,
> > >>
> > >> Chris
> > >>
> > >> On Wed, Jun 8, 2022 at 5:41 PM Jorge Esteban Quilcate Otoya <
> > >> quilcate.jo...@gmail.com> wrote:
> > >>
> > >> > Thanks, Chris!
> > >> >
> > >> > Please, find my comments below:
> > >> >
> > >> > On Tue, 7 Jun 2022 at 04:39, Chris Egerton  >
> > >> > wrote:
> > >> >
> > >> > > Hi Jorge,
> > >> > >
> > >> > > Thanks! Sorry for the delay; here are my thoughts:
> > >> > >
> > >> > > 1. Under the "Accessing multiple values by deep-scan" header it's
> > >> stated
> > >> > > that "If deep-scan is used, it must have only one field after the
> > >> > asterisk
> > >> > > level.". However, in example 3 for the Cast SMT and other examples
> > for
> > >> > > other SMTs, the spec contains a field of "*.child.k2", which
> appears
> > >> to
> > >> > > have two fields after the asterisk level. I may be
> misunderstanding
> > >> the
> > >> > > proposal, but it seems like the two contradict each other.
> > >> > >
> > >> >
> > >> > Thanks for catching this. I have clarified it by removing this
> > >> restriction.
> > >> > Also, have extended the deep-scan scenarios.
> > >> >
> > >> >
> > >> > >
> > >> > > 2. I'm a little unclear on why 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-28 Thread Chris Egerton
Hi Jorge,

Apologies for the long delay, had my own KIP-related work to focus on.

I think it's fine to include array accesses but it's not a blocker. I'm +1
either way. On that front though, I think the MaskField section might need
to be updated as it still mentions arrays and deep-scan?

Cheers,

Chris

On Tue, Jun 28, 2022 at 4:38 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Hi there,
>
> I have update the KIP to the previous state voted, including the
> configuration change from `field.style` to `field.syntax.version`.
>
> I'll bump the vote thread as well to see if there's agreement on adding
> this feature to Connect.
>
> Cheers,
> Jorge.
>
> On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks, Chris. Your feedback is much appreciated!
> >
> > I see how the current proposal might be underestimating some edge cases.
> > I'm happy to move the design for deep-scan and multi-values to future
> > developments related with this KIP and reduce its scope, though open for
> > more feedback.
> >
> > Also, just to be sure, are you proposing also to not include array access
> > at this stage?
> >
> > Thanks,
> > Jorge.
> >
> > On Tue, 14 Jun 2022 at 03:20, Chris Egerton 
> > wrote:
> >
> >> Hi Jorge,
> >>
> >> I've done some more thinking and I hate to say it, but I think the
> syntax
> >> does need to be expanded. Right now it's clear what "a.b" refers to and
> >> what "a..b" refers to, but what about "a...b"? Is that referring to
> >> subfield ".b" of field "a", or subfield "b" of field "a."? This gets
> even
> >> more complicated when thinking about fields whose names are exclusively
> >> made up of dots.
> >>
> >> I'm also a little hesitant to mix the cases of multi-value paths and
> deep
> >> scans. What if you only want to access one subfield deep for an SMT,
> >> instead of recursing through all the children of a given field? It's
> akin
> >> to the distinction between * and ** with file globbing patterns, and
> there
> >> could be a substantial performance difference if you have heavily-nested
> >> fields.
> >>
> >> Ultimately, I think that if the proposed "field.syntax.version" property
> >> sits well with people, it might be better to reduce the scope of the KIP
> >> back to the original proposal and just focus on adding support for
> >> explicitly-specified nested values, with no multi-value paths
> whatsoever,
> >> knowing that we have an easy way to introduce new syntax and features in
> >> the future. (We could probably leave the "a...b" case for that next
> >> version
> >> too.)
> >>
> >> I was a huge fan of this KIP before we started trying to address more
> >> complex use cases, and although I don't want to write those off, I think
> >> we
> >> may have bitten off more than we can chew in time for the 3.3.0 release
> >> and
> >> would hate to see this KIP get delayed as a result.
> >>
> >> I'd be really curious to hear from Joshua and Tom on this front, though.
> >> Is
> >> it acceptable to move more incrementally here and settle on the syntax
> >> version property as our means of introducing new features, or is it
> >> preferable to implement things monolithically and try to get everything
> >> (or
> >> at least, as much as possible) right the first time?
> >>
> >> Thanks again for your continued effort on this KIP!
> >>
> >> Cheers,
> >>
> >> Chris
> >>
> >> On Wed, Jun 8, 2022 at 5:41 PM Jorge Esteban Quilcate Otoya <
> >> quilcate.jo...@gmail.com> wrote:
> >>
> >> > Thanks, Chris!
> >> >
> >> > Please, find my comments below:
> >> >
> >> > On Tue, 7 Jun 2022 at 04:39, Chris Egerton 
> >> > wrote:
> >> >
> >> > > Hi Jorge,
> >> > >
> >> > > Thanks! Sorry for the delay; here are my thoughts:
> >> > >
> >> > > 1. Under the "Accessing multiple values by deep-scan" header it's
> >> stated
> >> > > that "If deep-scan is used, it must have only one field after the
> >> > asterisk
> >> > > level.". However, in example 3 for the Cast SMT and other examples
> for
> >> > > other SMTs, the spec contains a field of "*.child.k2", which appears
> >> to
> >> > > have two fields after the asterisk level. I may be misunderstanding
> >> the
> >> > > proposal, but it seems like the two contradict each other.
> >> > >
> >> >
> >> > Thanks for catching this. I have clarified it by removing this
> >> restriction.
> >> > Also, have extended the deep-scan scenarios.
> >> >
> >> >
> >> > >
> >> > > 2. I'm a little unclear on why we need the special handling for
> arrays
> >> > > where, for an array field "a", the field name "a" can be treated as
> >> > either
> >> > > the array itself, or every element in the array. Is there a reason
> we
> >> > can't
> >> > > use the field name "a.*" to handle the latter case, and "a" to
> handle
> >> the
> >> > > former?
> >> > >
> >> >
> >> > Agree, this is confusing. I like the `a.*` approach to access array
> >> items.
> >> > I have added this to the proposal.
> >> >
> >> >
> >> > >
> >> > > 3. 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-28 Thread Jorge Esteban Quilcate Otoya
Hi there,

I have update the KIP to the previous state voted, including the
configuration change from `field.style` to `field.syntax.version`.

I'll bump the vote thread as well to see if there's agreement on adding
this feature to Connect.

Cheers,
Jorge.

On Wed, 15 Jun 2022 at 23:02, Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris. Your feedback is much appreciated!
>
> I see how the current proposal might be underestimating some edge cases.
> I'm happy to move the design for deep-scan and multi-values to future
> developments related with this KIP and reduce its scope, though open for
> more feedback.
>
> Also, just to be sure, are you proposing also to not include array access
> at this stage?
>
> Thanks,
> Jorge.
>
> On Tue, 14 Jun 2022 at 03:20, Chris Egerton 
> wrote:
>
>> Hi Jorge,
>>
>> I've done some more thinking and I hate to say it, but I think the syntax
>> does need to be expanded. Right now it's clear what "a.b" refers to and
>> what "a..b" refers to, but what about "a...b"? Is that referring to
>> subfield ".b" of field "a", or subfield "b" of field "a."? This gets even
>> more complicated when thinking about fields whose names are exclusively
>> made up of dots.
>>
>> I'm also a little hesitant to mix the cases of multi-value paths and deep
>> scans. What if you only want to access one subfield deep for an SMT,
>> instead of recursing through all the children of a given field? It's akin
>> to the distinction between * and ** with file globbing patterns, and there
>> could be a substantial performance difference if you have heavily-nested
>> fields.
>>
>> Ultimately, I think that if the proposed "field.syntax.version" property
>> sits well with people, it might be better to reduce the scope of the KIP
>> back to the original proposal and just focus on adding support for
>> explicitly-specified nested values, with no multi-value paths whatsoever,
>> knowing that we have an easy way to introduce new syntax and features in
>> the future. (We could probably leave the "a...b" case for that next
>> version
>> too.)
>>
>> I was a huge fan of this KIP before we started trying to address more
>> complex use cases, and although I don't want to write those off, I think
>> we
>> may have bitten off more than we can chew in time for the 3.3.0 release
>> and
>> would hate to see this KIP get delayed as a result.
>>
>> I'd be really curious to hear from Joshua and Tom on this front, though.
>> Is
>> it acceptable to move more incrementally here and settle on the syntax
>> version property as our means of introducing new features, or is it
>> preferable to implement things monolithically and try to get everything
>> (or
>> at least, as much as possible) right the first time?
>>
>> Thanks again for your continued effort on this KIP!
>>
>> Cheers,
>>
>> Chris
>>
>> On Wed, Jun 8, 2022 at 5:41 PM Jorge Esteban Quilcate Otoya <
>> quilcate.jo...@gmail.com> wrote:
>>
>> > Thanks, Chris!
>> >
>> > Please, find my comments below:
>> >
>> > On Tue, 7 Jun 2022 at 04:39, Chris Egerton 
>> > wrote:
>> >
>> > > Hi Jorge,
>> > >
>> > > Thanks! Sorry for the delay; here are my thoughts:
>> > >
>> > > 1. Under the "Accessing multiple values by deep-scan" header it's
>> stated
>> > > that "If deep-scan is used, it must have only one field after the
>> > asterisk
>> > > level.". However, in example 3 for the Cast SMT and other examples for
>> > > other SMTs, the spec contains a field of "*.child.k2", which appears
>> to
>> > > have two fields after the asterisk level. I may be misunderstanding
>> the
>> > > proposal, but it seems like the two contradict each other.
>> > >
>> >
>> > Thanks for catching this. I have clarified it by removing this
>> restriction.
>> > Also, have extended the deep-scan scenarios.
>> >
>> >
>> > >
>> > > 2. I'm a little unclear on why we need the special handling for arrays
>> > > where, for an array field "a", the field name "a" can be treated as
>> > either
>> > > the array itself, or every element in the array. Is there a reason we
>> > can't
>> > > use the field name "a.*" to handle the latter case, and "a" to handle
>> the
>> > > former?
>> > >
>> >
>> > Agree, this is confusing. I like the `a.*` approach to access array
>> items.
>> > I have added this to the proposal.
>> >
>> >
>> > >
>> > > 3. How would a user specify that they'd like to access a field with
>> the
>> > > literal name "*"?
>> > >
>> >
>> > Good one. I'm proposing an approach similar to how it's proposed to
>> escape
>> > dots, with a double-asterisk. Curious on your thoughts around this.
>> >
>> >
>> > >
>> > > 4. For the Cast SMT, do you think it might bite some people if fields
>> > that
>> > > can't be cast correctly are silently ignored? I'm imagining the case
>> > where
>> > > none of the fields in a multi-path expression can be cast correctly
>> and
>> > it
>> > > ends up eating half of someone's day to track down why their SMT isn't
>> > > doing anything.
>> > >
>> >
>> > If I 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-15 Thread Jorge Esteban Quilcate Otoya
Thanks, Chris. Your feedback is much appreciated!

I see how the current proposal might be underestimating some edge cases.
I'm happy to move the design for deep-scan and multi-values to future
developments related with this KIP and reduce its scope, though open for
more feedback.

Also, just to be sure, are you proposing also to not include array access
at this stage?

Thanks,
Jorge.

On Tue, 14 Jun 2022 at 03:20, Chris Egerton  wrote:

> Hi Jorge,
>
> I've done some more thinking and I hate to say it, but I think the syntax
> does need to be expanded. Right now it's clear what "a.b" refers to and
> what "a..b" refers to, but what about "a...b"? Is that referring to
> subfield ".b" of field "a", or subfield "b" of field "a."? This gets even
> more complicated when thinking about fields whose names are exclusively
> made up of dots.
>
> I'm also a little hesitant to mix the cases of multi-value paths and deep
> scans. What if you only want to access one subfield deep for an SMT,
> instead of recursing through all the children of a given field? It's akin
> to the distinction between * and ** with file globbing patterns, and there
> could be a substantial performance difference if you have heavily-nested
> fields.
>
> Ultimately, I think that if the proposed "field.syntax.version" property
> sits well with people, it might be better to reduce the scope of the KIP
> back to the original proposal and just focus on adding support for
> explicitly-specified nested values, with no multi-value paths whatsoever,
> knowing that we have an easy way to introduce new syntax and features in
> the future. (We could probably leave the "a...b" case for that next version
> too.)
>
> I was a huge fan of this KIP before we started trying to address more
> complex use cases, and although I don't want to write those off, I think we
> may have bitten off more than we can chew in time for the 3.3.0 release and
> would hate to see this KIP get delayed as a result.
>
> I'd be really curious to hear from Joshua and Tom on this front, though. Is
> it acceptable to move more incrementally here and settle on the syntax
> version property as our means of introducing new features, or is it
> preferable to implement things monolithically and try to get everything (or
> at least, as much as possible) right the first time?
>
> Thanks again for your continued effort on this KIP!
>
> Cheers,
>
> Chris
>
> On Wed, Jun 8, 2022 at 5:41 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks, Chris!
> >
> > Please, find my comments below:
> >
> > On Tue, 7 Jun 2022 at 04:39, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks! Sorry for the delay; here are my thoughts:
> > >
> > > 1. Under the "Accessing multiple values by deep-scan" header it's
> stated
> > > that "If deep-scan is used, it must have only one field after the
> > asterisk
> > > level.". However, in example 3 for the Cast SMT and other examples for
> > > other SMTs, the spec contains a field of "*.child.k2", which appears to
> > > have two fields after the asterisk level. I may be misunderstanding the
> > > proposal, but it seems like the two contradict each other.
> > >
> >
> > Thanks for catching this. I have clarified it by removing this
> restriction.
> > Also, have extended the deep-scan scenarios.
> >
> >
> > >
> > > 2. I'm a little unclear on why we need the special handling for arrays
> > > where, for an array field "a", the field name "a" can be treated as
> > either
> > > the array itself, or every element in the array. Is there a reason we
> > can't
> > > use the field name "a.*" to handle the latter case, and "a" to handle
> the
> > > former?
> > >
> >
> > Agree, this is confusing. I like the `a.*` approach to access array
> items.
> > I have added this to the proposal.
> >
> >
> > >
> > > 3. How would a user specify that they'd like to access a field with the
> > > literal name "*"?
> > >
> >
> > Good one. I'm proposing an approach similar to how it's proposed to
> escape
> > dots, with a double-asterisk. Curious on your thoughts around this.
> >
> >
> > >
> > > 4. For the Cast SMT, do you think it might bite some people if fields
> > that
> > > can't be cast correctly are silently ignored? I'm imagining the case
> > where
> > > none of the fields in a multi-path expression can be cast correctly and
> > it
> > > ends up eating half of someone's day to track down why their SMT isn't
> > > doing anything.
> > >
> >
> > If I understand correctly, this challenge could be relevant across SMTs.
> > At the moment, most/all? SMTs just silently ignore.
> > Was thinking about adding a flag `field.on.path.not.found` to either
> ignore
> > or fail when no paths are found. What do your think?
> >
> >
> > >
> > > 5. For the ExtractField and ValueToKey SMTs, what happens if a
> deep-scan
> > > field name is used, but only one field is found? Is the resulting field
> > > still an array, or is it just the single field that was found? (FWIW
> I'm

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-13 Thread Chris Egerton
Hi Jorge,

I've done some more thinking and I hate to say it, but I think the syntax
does need to be expanded. Right now it's clear what "a.b" refers to and
what "a..b" refers to, but what about "a...b"? Is that referring to
subfield ".b" of field "a", or subfield "b" of field "a."? This gets even
more complicated when thinking about fields whose names are exclusively
made up of dots.

I'm also a little hesitant to mix the cases of multi-value paths and deep
scans. What if you only want to access one subfield deep for an SMT,
instead of recursing through all the children of a given field? It's akin
to the distinction between * and ** with file globbing patterns, and there
could be a substantial performance difference if you have heavily-nested
fields.

Ultimately, I think that if the proposed "field.syntax.version" property
sits well with people, it might be better to reduce the scope of the KIP
back to the original proposal and just focus on adding support for
explicitly-specified nested values, with no multi-value paths whatsoever,
knowing that we have an easy way to introduce new syntax and features in
the future. (We could probably leave the "a...b" case for that next version
too.)

I was a huge fan of this KIP before we started trying to address more
complex use cases, and although I don't want to write those off, I think we
may have bitten off more than we can chew in time for the 3.3.0 release and
would hate to see this KIP get delayed as a result.

I'd be really curious to hear from Joshua and Tom on this front, though. Is
it acceptable to move more incrementally here and settle on the syntax
version property as our means of introducing new features, or is it
preferable to implement things monolithically and try to get everything (or
at least, as much as possible) right the first time?

Thanks again for your continued effort on this KIP!

Cheers,

Chris

On Wed, Jun 8, 2022 at 5:41 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris!
>
> Please, find my comments below:
>
> On Tue, 7 Jun 2022 at 04:39, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Thanks! Sorry for the delay; here are my thoughts:
> >
> > 1. Under the "Accessing multiple values by deep-scan" header it's stated
> > that "If deep-scan is used, it must have only one field after the
> asterisk
> > level.". However, in example 3 for the Cast SMT and other examples for
> > other SMTs, the spec contains a field of "*.child.k2", which appears to
> > have two fields after the asterisk level. I may be misunderstanding the
> > proposal, but it seems like the two contradict each other.
> >
>
> Thanks for catching this. I have clarified it by removing this restriction.
> Also, have extended the deep-scan scenarios.
>
>
> >
> > 2. I'm a little unclear on why we need the special handling for arrays
> > where, for an array field "a", the field name "a" can be treated as
> either
> > the array itself, or every element in the array. Is there a reason we
> can't
> > use the field name "a.*" to handle the latter case, and "a" to handle the
> > former?
> >
>
> Agree, this is confusing. I like the `a.*` approach to access array items.
> I have added this to the proposal.
>
>
> >
> > 3. How would a user specify that they'd like to access a field with the
> > literal name "*"?
> >
>
> Good one. I'm proposing an approach similar to how it's proposed to escape
> dots, with a double-asterisk. Curious on your thoughts around this.
>
>
> >
> > 4. For the Cast SMT, do you think it might bite some people if fields
> that
> > can't be cast correctly are silently ignored? I'm imagining the case
> where
> > none of the fields in a multi-path expression can be cast correctly and
> it
> > ends up eating half of someone's day to track down why their SMT isn't
> > doing anything.
> >
>
> If I understand correctly, this challenge could be relevant across SMTs.
> At the moment, most/all? SMTs just silently ignore.
> Was thinking about adding a flag `field.on.path.not.found` to either ignore
> or fail when no paths are found. What do your think?
>
>
> >
> > 5. For the ExtractField and ValueToKey SMTs, what happens if a deep-scan
> > field name is used, but only one field is found? Is the resulting field
> > still an array, or is it just the single field that was found? (FWIW I'm
> > leaning towards keeping it an array just to keep schemas consistent in a
> > pipeline in case the number of fields found fluctuates across records.)
> >
> > Agree. Will clarify that an array is always produced even for 1 or 0
> fields are found.
>
>
> > 6. (Nit) For the HeaderFrom SMT, it's stated that "As this SMT affects
> only
> > existing fields, additional configurations will not be required.". Given
> > the new "field.syntax.version" property, this part should probably be
> > removed.
> >
> Agree.
>
>
> >
> > 7. Is recursive descent intentionally excluded? That was an important
> part
> > of Joshua's KIP and his feedback on this KIP; I think it's worth 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-08 Thread Jorge Esteban Quilcate Otoya
Thanks, Chris!

Please, find my comments below:

On Tue, 7 Jun 2022 at 04:39, Chris Egerton  wrote:

> Hi Jorge,
>
> Thanks! Sorry for the delay; here are my thoughts:
>
> 1. Under the "Accessing multiple values by deep-scan" header it's stated
> that "If deep-scan is used, it must have only one field after the asterisk
> level.". However, in example 3 for the Cast SMT and other examples for
> other SMTs, the spec contains a field of "*.child.k2", which appears to
> have two fields after the asterisk level. I may be misunderstanding the
> proposal, but it seems like the two contradict each other.
>

Thanks for catching this. I have clarified it by removing this restriction.
Also, have extended the deep-scan scenarios.


>
> 2. I'm a little unclear on why we need the special handling for arrays
> where, for an array field "a", the field name "a" can be treated as either
> the array itself, or every element in the array. Is there a reason we can't
> use the field name "a.*" to handle the latter case, and "a" to handle the
> former?
>

Agree, this is confusing. I like the `a.*` approach to access array items.
I have added this to the proposal.


>
> 3. How would a user specify that they'd like to access a field with the
> literal name "*"?
>

Good one. I'm proposing an approach similar to how it's proposed to escape
dots, with a double-asterisk. Curious on your thoughts around this.


>
> 4. For the Cast SMT, do you think it might bite some people if fields that
> can't be cast correctly are silently ignored? I'm imagining the case where
> none of the fields in a multi-path expression can be cast correctly and it
> ends up eating half of someone's day to track down why their SMT isn't
> doing anything.
>

If I understand correctly, this challenge could be relevant across SMTs.
At the moment, most/all? SMTs just silently ignore.
Was thinking about adding a flag `field.on.path.not.found` to either ignore
or fail when no paths are found. What do your think?


>
> 5. For the ExtractField and ValueToKey SMTs, what happens if a deep-scan
> field name is used, but only one field is found? Is the resulting field
> still an array, or is it just the single field that was found? (FWIW I'm
> leaning towards keeping it an array just to keep schemas consistent in a
> pipeline in case the number of fields found fluctuates across records.)
>
> Agree. Will clarify that an array is always produced even for 1 or 0
fields are found.


> 6. (Nit) For the HeaderFrom SMT, it's stated that "As this SMT affects only
> existing fields, additional configurations will not be required.". Given
> the new "field.syntax.version" property, this part should probably be
> removed.
>
Agree.


>
> 7. Is recursive descent intentionally excluded? That was an important part
> of Joshua's KIP and his feedback on this KIP; I think it's worth pursuing
> if we can.
>

My understanding from Joshua's feedback is that by including support for
deep-scan, we are already covering the recursive functionality. Though, I
may be missing something.

>
>
> Cheers,
>
> Chris
>
> On Tue, May 24, 2022 at 3:49 AM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks, Chris!
> >
> > I have updated the KIP with the rejected alternatives updated. Also, I
> have
> > drafted the support for arrays and deep scans as part of the proposed
> > notation to make it more complete, even though these can be implemented
> in
> > multiple PRs.
> >
> > Looking forward to your feedback.
> >
> > Cheers,
> > Jorge.
> >
> > On Sat, 21 May 2022 at 17:39, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > I really appreciate the effort you've made to simplify the syntax and
> > > feature set of a JSONPath-based approach as much as possible. I'm still
> > > hesitant to continue with it, though.
> > >
> > > 1. The syntax is much less friendly. Just compare "top.mid.bottom" to
> > > "$['top']['mid']['bottom']"... not everyone uses JSONPath or even JSON,
> > and
> > > the learning curve for the former is going to be steeper. The examples
> in
> > > the new KIP draft you published demonstrate this pretty well, and this
> is
> > > without diving into the details of what escape syntax would look like.
> > >
> > > 2. The three new major features that this syntax adds (array accesses,
> > deep
> > > scans, and multi-value paths) could all be added pretty easily to the
> dot
> > > notation syntax without introducing brackets and dollar signs. Array
> > > accesses can be described using the same syntax as struct/map field
> > access,
> > > deep scans can be described using '*', and multi-value paths can be
> > > described by referencing the name of a field that's expected to have
> > > children. These are all top-of-the-head ideas and can probably be
> > refined,
> > > but hopefully they demonstrate that we can keep the syntax simple
> without
> > > sacrificing features. Of course, the question of leaving room for
> future
> > > features might arise... given that these 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-06-06 Thread Chris Egerton
Hi Jorge,

Thanks! Sorry for the delay; here are my thoughts:

1. Under the "Accessing multiple values by deep-scan" header it's stated
that "If deep-scan is used, it must have only one field after the asterisk
level.". However, in example 3 for the Cast SMT and other examples for
other SMTs, the spec contains a field of "*.child.k2", which appears to
have two fields after the asterisk level. I may be misunderstanding the
proposal, but it seems like the two contradict each other.

2. I'm a little unclear on why we need the special handling for arrays
where, for an array field "a", the field name "a" can be treated as either
the array itself, or every element in the array. Is there a reason we can't
use the field name "a.*" to handle the latter case, and "a" to handle the
former?

3. How would a user specify that they'd like to access a field with the
literal name "*"?

4. For the Cast SMT, do you think it might bite some people if fields that
can't be cast correctly are silently ignored? I'm imagining the case where
none of the fields in a multi-path expression can be cast correctly and it
ends up eating half of someone's day to track down why their SMT isn't
doing anything.

5. For the ExtractField and ValueToKey SMTs, what happens if a deep-scan
field name is used, but only one field is found? Is the resulting field
still an array, or is it just the single field that was found? (FWIW I'm
leaning towards keeping it an array just to keep schemas consistent in a
pipeline in case the number of fields found fluctuates across records.)

6. (Nit) For the HeaderFrom SMT, it's stated that "As this SMT affects only
existing fields, additional configurations will not be required.". Given
the new "field.syntax.version" property, this part should probably be
removed.

7. Is recursive descent intentionally excluded? That was an important part
of Joshua's KIP and his feedback on this KIP; I think it's worth pursuing
if we can.

Cheers,

Chris

On Tue, May 24, 2022 at 3:49 AM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris!
>
> I have updated the KIP with the rejected alternatives updated. Also, I have
> drafted the support for arrays and deep scans as part of the proposed
> notation to make it more complete, even though these can be implemented in
> multiple PRs.
>
> Looking forward to your feedback.
>
> Cheers,
> Jorge.
>
> On Sat, 21 May 2022 at 17:39, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > I really appreciate the effort you've made to simplify the syntax and
> > feature set of a JSONPath-based approach as much as possible. I'm still
> > hesitant to continue with it, though.
> >
> > 1. The syntax is much less friendly. Just compare "top.mid.bottom" to
> > "$['top']['mid']['bottom']"... not everyone uses JSONPath or even JSON,
> and
> > the learning curve for the former is going to be steeper. The examples in
> > the new KIP draft you published demonstrate this pretty well, and this is
> > without diving into the details of what escape syntax would look like.
> >
> > 2. The three new major features that this syntax adds (array accesses,
> deep
> > scans, and multi-value paths) could all be added pretty easily to the dot
> > notation syntax without introducing brackets and dollar signs. Array
> > accesses can be described using the same syntax as struct/map field
> access,
> > deep scans can be described using '*', and multi-value paths can be
> > described by referencing the name of a field that's expected to have
> > children. These are all top-of-the-head ideas and can probably be
> refined,
> > but hopefully they demonstrate that we can keep the syntax simple without
> > sacrificing features. Of course, the question of leaving room for future
> > features might arise... given that these are the out-of-the-box SMTs that
> > are likely to be the first that many people encounter, I'd err on the
> side
> > of simplicity and a gentle learning curve; if people need to get more out
> > of transforms, the option to implement their own is still there. If we
> can
> > address 95% of use cases with something easy to use, it's not worth
> making
> > the feature harder for everyone to use just to accommodate the remaining
> > 5%.
> >
> > 3. The advantage of leveraging an existing syntax is twofold: users who
> are
> > already familiar with that syntax don't need to learn a new syntax, and
> > maintainers of the syntax get to leverage existing libraries and
> > documentation for the syntax. With the current proposal, reusing
> libraries
> > is off the table, which means that we're going to have to parse this
> > ourselves (including all the escape syntax edge cases), and we won't be
> > able to automatically leverage new features added to that syntax. And
> given
> > how stripped-down the syntax is in comparison to full JSONPath, there's
> > still going to be a learning curve for users who are already familiar
> with
> > it, and we'll still have to document how Connect's variant of 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-05-24 Thread Jorge Esteban Quilcate Otoya
Thanks, Chris!

I have updated the KIP with the rejected alternatives updated. Also, I have
drafted the support for arrays and deep scans as part of the proposed
notation to make it more complete, even though these can be implemented in
multiple PRs.

Looking forward to your feedback.

Cheers,
Jorge.

On Sat, 21 May 2022 at 17:39, Chris Egerton  wrote:

> Hi Jorge,
>
> I really appreciate the effort you've made to simplify the syntax and
> feature set of a JSONPath-based approach as much as possible. I'm still
> hesitant to continue with it, though.
>
> 1. The syntax is much less friendly. Just compare "top.mid.bottom" to
> "$['top']['mid']['bottom']"... not everyone uses JSONPath or even JSON, and
> the learning curve for the former is going to be steeper. The examples in
> the new KIP draft you published demonstrate this pretty well, and this is
> without diving into the details of what escape syntax would look like.
>
> 2. The three new major features that this syntax adds (array accesses, deep
> scans, and multi-value paths) could all be added pretty easily to the dot
> notation syntax without introducing brackets and dollar signs. Array
> accesses can be described using the same syntax as struct/map field access,
> deep scans can be described using '*', and multi-value paths can be
> described by referencing the name of a field that's expected to have
> children. These are all top-of-the-head ideas and can probably be refined,
> but hopefully they demonstrate that we can keep the syntax simple without
> sacrificing features. Of course, the question of leaving room for future
> features might arise... given that these are the out-of-the-box SMTs that
> are likely to be the first that many people encounter, I'd err on the side
> of simplicity and a gentle learning curve; if people need to get more out
> of transforms, the option to implement their own is still there. If we can
> address 95% of use cases with something easy to use, it's not worth making
> the feature harder for everyone to use just to accommodate the remaining
> 5%.
>
> 3. The advantage of leveraging an existing syntax is twofold: users who are
> already familiar with that syntax don't need to learn a new syntax, and
> maintainers of the syntax get to leverage existing libraries and
> documentation for the syntax. With the current proposal, reusing libraries
> is off the table, which means that we're going to have to parse this
> ourselves (including all the escape syntax edge cases), and we won't be
> able to automatically leverage new features added to that syntax. And given
> how stripped-down the syntax is in comparison to full JSONPath, there's
> still going to be a learning curve for users who are already familiar with
> it, and we'll still have to document how Connect's variant of JSONPath
> works either instead of or in addition to just linking to a well-maintained
> third-party docs site.
>
> On the topic of "field.path" and "include.path" vs. "field.style", I
> actually think that a single property per SMT is cleaner and simpler.
> Allowing users to mix and match styles within the same SMT config seems
> like a recipe for confusion, and with a single property that dictates field
> syntax behavior, we leave the door open to change the default at a later
> date. We could even fully remove support for plain field notation at some
> point and still be able to retain the simple property names of "field",
> "include", etc. instead of forcing people to use "field.path" and the like.
> That said, the term "field.style" and the permitted values might be a
> little ambiguous. One alternative, though a little heavy-handed, is to
> change it to "field.syntax.version" with permitted values of "V1" (default,
> equivalent to "field.style = plain") and "V2" (equivalent to "field.style =
> nested"). This would leave us room in the future to make further changes to
> the syntax without having to come up with new names, although it does
> sacrifice a little bit in that the permitted values are no longer
> self-describing.
>
> Cheers,
>
> Chris
>
> On Sun, May 15, 2022 at 4:51 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thank you all for your feedback, and sorry for the long wait for a reply.
> >
> > I would like to explore the idea of JSONPath-inspired/subset notation a
> bit
> > further:
> >
> > It will need to be a much-reduced version of JSONPath:
> > - No full support for JsonPath therefore an additional dependency.
> > - All paths must start with '$'
> > - No functions support or other operators allowed.
> > - JsonPath dotted and square-bracket notations can be supported to avoid
> > escaping dots or other characters: `$a.b.c` and `$['a.b']['c']` - or we
> > could only support the second one as it's more complete.
> > - Add support for arrays with `[]`, e.g. `$a.[1].b`
> > - Add support for multiple-value paths using array access `$a.*.b` or
> deep
> > scan `$a..b`.
> > - Some SMTs that will benefit from 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-05-21 Thread Chris Egerton
Hi Jorge,

I really appreciate the effort you've made to simplify the syntax and
feature set of a JSONPath-based approach as much as possible. I'm still
hesitant to continue with it, though.

1. The syntax is much less friendly. Just compare "top.mid.bottom" to
"$['top']['mid']['bottom']"... not everyone uses JSONPath or even JSON, and
the learning curve for the former is going to be steeper. The examples in
the new KIP draft you published demonstrate this pretty well, and this is
without diving into the details of what escape syntax would look like.

2. The three new major features that this syntax adds (array accesses, deep
scans, and multi-value paths) could all be added pretty easily to the dot
notation syntax without introducing brackets and dollar signs. Array
accesses can be described using the same syntax as struct/map field access,
deep scans can be described using '*', and multi-value paths can be
described by referencing the name of a field that's expected to have
children. These are all top-of-the-head ideas and can probably be refined,
but hopefully they demonstrate that we can keep the syntax simple without
sacrificing features. Of course, the question of leaving room for future
features might arise... given that these are the out-of-the-box SMTs that
are likely to be the first that many people encounter, I'd err on the side
of simplicity and a gentle learning curve; if people need to get more out
of transforms, the option to implement their own is still there. If we can
address 95% of use cases with something easy to use, it's not worth making
the feature harder for everyone to use just to accommodate the remaining 5%.

3. The advantage of leveraging an existing syntax is twofold: users who are
already familiar with that syntax don't need to learn a new syntax, and
maintainers of the syntax get to leverage existing libraries and
documentation for the syntax. With the current proposal, reusing libraries
is off the table, which means that we're going to have to parse this
ourselves (including all the escape syntax edge cases), and we won't be
able to automatically leverage new features added to that syntax. And given
how stripped-down the syntax is in comparison to full JSONPath, there's
still going to be a learning curve for users who are already familiar with
it, and we'll still have to document how Connect's variant of JSONPath
works either instead of or in addition to just linking to a well-maintained
third-party docs site.

On the topic of "field.path" and "include.path" vs. "field.style", I
actually think that a single property per SMT is cleaner and simpler.
Allowing users to mix and match styles within the same SMT config seems
like a recipe for confusion, and with a single property that dictates field
syntax behavior, we leave the door open to change the default at a later
date. We could even fully remove support for plain field notation at some
point and still be able to retain the simple property names of "field",
"include", etc. instead of forcing people to use "field.path" and the like.
That said, the term "field.style" and the permitted values might be a
little ambiguous. One alternative, though a little heavy-handed, is to
change it to "field.syntax.version" with permitted values of "V1" (default,
equivalent to "field.style = plain") and "V2" (equivalent to "field.style =
nested"). This would leave us room in the future to make further changes to
the syntax without having to come up with new names, although it does
sacrifice a little bit in that the permitted values are no longer
self-describing.

Cheers,

Chris

On Sun, May 15, 2022 at 4:51 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thank you all for your feedback, and sorry for the long wait for a reply.
>
> I would like to explore the idea of JSONPath-inspired/subset notation a bit
> further:
>
> It will need to be a much-reduced version of JSONPath:
> - No full support for JsonPath therefore an additional dependency.
> - All paths must start with '$'
> - No functions support or other operators allowed.
> - JsonPath dotted and square-bracket notations can be supported to avoid
> escaping dots or other characters: `$a.b.c` and `$['a.b']['c']` - or we
> could only support the second one as it's more complete.
> - Add support for arrays with `[]`, e.g. `$a.[1].b`
> - Add support for multiple-value paths using array access `$a.*.b` or deep
> scan `$a..b`.
> - Some SMTs that will benefit from this: `MaskField`, `Cast`,
> `ReplaceField`.
> - We could introduce a `path[s]` config under the current configurations to
> apply this feature so no compatibility issues are introduced: e.g.
> `field.path`, `fields.paths`, `exclude.path`.
>
> With these,  100, 101, 102, and 103 will be effectively solved.
>
> The main challenge that I see at the moment is that by being JSON
> path-like, there may be some edge cases that I can't foresee at the moment,
> that could make this hard to implement, test, and maintain 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-05-15 Thread Jorge Esteban Quilcate Otoya
Thank you all for your feedback, and sorry for the long wait for a reply.

I would like to explore the idea of JSONPath-inspired/subset notation a bit
further:

It will need to be a much-reduced version of JSONPath:
- No full support for JsonPath therefore an additional dependency.
- All paths must start with '$'
- No functions support or other operators allowed.
- JsonPath dotted and square-bracket notations can be supported to avoid
escaping dots or other characters: `$a.b.c` and `$['a.b']['c']` - or we
could only support the second one as it's more complete.
- Add support for arrays with `[]`, e.g. `$a.[1].b`
- Add support for multiple-value paths using array access `$a.*.b` or deep
scan `$a..b`.
- Some SMTs that will benefit from this: `MaskField`, `Cast`,
`ReplaceField`.
- We could introduce a `path[s]` config under the current configurations to
apply this feature so no compatibility issues are introduced: e.g.
`field.path`, `fields.paths`, `exclude.path`.

With these,  100, 101, 102, and 103 will be effectively solved.

The main challenge that I see at the moment is that by being JSON
path-like, there may be some edge cases that I can't foresee at the moment,
that could make this hard to implement, test, and maintain in the long run.

I'll appreciate your feedback on how this JSONPath-based alternative
compares to the dotted notation initially proposed, and if it matches your
feedback.
I have drafted a copy of the KIP to change the examples to JSONPath
approach and validate some differences:
https://cwiki.apache.org/confluence/x/9BihD

About 104. I agreed with Chris. We can handle this as part of a new KIP.

Thanks,
Jorge.

On Sun, 24 Apr 2022 at 17:27, Chris Egerton  wrote:

> Hi Joshua,
>
> I have a few reservations about using JsonPath notation here.
>
> 1. There's likely to be a substantial performance penalty for converting
> between the Kafka Connect format and something that a JsonPath library
> would understand.
>
> 2. The complexity of the feature will be significantly higher. It will be
> harder to test, document, and implement. There will be many more edge cases
> to consider and support, and we'll be on the hook to handle any bugs or
> inconsistencies that arise, either as a result of our use of the JsonPath
> library we choose, or as a result of bugs in that library.
>
> 3. It's not clear that JsonPath is superior or even suitable for some of
> the SMTs proposed here. What would be the advantages of JsonPath with the
> InsertField or HoistField SMTs?
>
> I also don't think that adding dot notation is unfriendly to users; many
> have proposed this type of syntax in the past, and it's frequently used in
> informal discussions to refer to nested fields. If the proposed syntax was
> not already intuitive then a case against deciding on our own might be more
> convincing, but as things stand, simple dot notation is likely going to be
> easier for users to understand than JsonPath syntax.
>
> Cheers,
>
> Chris
>
> On Fri, Apr 22, 2022 at 9:31 AM Joshua Grisham 
> wrote:
>
> > Hello all!
> >
> > Sorry that I come a bit later to the party here, but I am the one who
> wrote
> > KIP-683 [1] for recursive support (just simply looping through all child
> > non-primitive structures for the same matching name(s)) which is a
> slightly
> > different way to try and solve a similar requirement -- unfortunately at
> > the time the dev community was not quite as active and then I also got
> busy
> > with work and just life in general so wasn't able to follow up or push
> it.
> >
> > I do think it is a very good idea to have some kind of path-like
> expression
> > to be able to specifically address a nested field, as I can see that the
> > simple "recursive" case could potentially result in unwanted or
> unexpected
> > behavior, plus there is the potential to introduce a bit of a performance
> > hit to always loop through everything in cases where the schemas/values
> > might be quite large.
> >
> > One thing I wanted to ask: instead of creating a new "path parser"
> > including some bespoke or "borrowed" syntax, why not just use something
> > that already exists? Specifically here I am thinking about JsonPath (
> > https://github.com/json-path/JsonPath)
> >
> > There is already quite nice support in JsonPath for handling special
> > characters in field names, for handling different non-primitive types
> > (arrays etc), for handling multiple levels of nesting, etc etc.  Would it
> > be possible to instead to re-think this and maybe have some kind of
> > JsonPath-based Schema selector / updater and/or JsonPath-based Value
> > selector / updater? Conceptually this feels like it makes sense to me, as
> > from the top of my head it would be quite a natural fit to map a Json
> data
> > structure to the Connect API data structure (and you could potentially
> even
> > try to leverage the existing Json-to-Connect serializer/deserializer to
> > help out with this even in a more "out of the box"-feeling kind of 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-24 Thread Chris Egerton
Hi Joshua,

I have a few reservations about using JsonPath notation here.

1. There's likely to be a substantial performance penalty for converting
between the Kafka Connect format and something that a JsonPath library
would understand.

2. The complexity of the feature will be significantly higher. It will be
harder to test, document, and implement. There will be many more edge cases
to consider and support, and we'll be on the hook to handle any bugs or
inconsistencies that arise, either as a result of our use of the JsonPath
library we choose, or as a result of bugs in that library.

3. It's not clear that JsonPath is superior or even suitable for some of
the SMTs proposed here. What would be the advantages of JsonPath with the
InsertField or HoistField SMTs?

I also don't think that adding dot notation is unfriendly to users; many
have proposed this type of syntax in the past, and it's frequently used in
informal discussions to refer to nested fields. If the proposed syntax was
not already intuitive then a case against deciding on our own might be more
convincing, but as things stand, simple dot notation is likely going to be
easier for users to understand than JsonPath syntax.

Cheers,

Chris

On Fri, Apr 22, 2022 at 9:31 AM Joshua Grisham 
wrote:

> Hello all!
>
> Sorry that I come a bit later to the party here, but I am the one who wrote
> KIP-683 [1] for recursive support (just simply looping through all child
> non-primitive structures for the same matching name(s)) which is a slightly
> different way to try and solve a similar requirement -- unfortunately at
> the time the dev community was not quite as active and then I also got busy
> with work and just life in general so wasn't able to follow up or push it.
>
> I do think it is a very good idea to have some kind of path-like expression
> to be able to specifically address a nested field, as I can see that the
> simple "recursive" case could potentially result in unwanted or unexpected
> behavior, plus there is the potential to introduce a bit of a performance
> hit to always loop through everything in cases where the schemas/values
> might be quite large.
>
> One thing I wanted to ask: instead of creating a new "path parser"
> including some bespoke or "borrowed" syntax, why not just use something
> that already exists? Specifically here I am thinking about JsonPath (
> https://github.com/json-path/JsonPath)
>
> There is already quite nice support in JsonPath for handling special
> characters in field names, for handling different non-primitive types
> (arrays etc), for handling multiple levels of nesting, etc etc.  Would it
> be possible to instead to re-think this and maybe have some kind of
> JsonPath-based Schema selector / updater and/or JsonPath-based Value
> selector / updater? Conceptually this feels like it makes sense to me, as
> from the top of my head it would be quite a natural fit to map a Json data
> structure to the Connect API data structure (and you could potentially even
> try to leverage the existing Json-to-Connect serializer/deserializer to
> help out with this even in a more "out of the box"-feeling kind of way).
>
> Maybe also as Tom mentioned, this part (in my example, this JsonPath-based
> "thing") could even be a generic API that could be used by any SMT,
> including used in custom ones built by the community.  Then I think to use
> a completely separate config property somehow related to "path" (as Tom
> also mentioned) would also make a lot of sense here as well. This way, if
> you select based on "path" then this JsonPath-based API would be used,
> otherwise it could use something similar to the existing get-field based
> approach (which I guess could also be refactored into some kind of utility
> / API as well if it made sense?)
>
> And with that in mind, if this was the kind of direction to go, then a
> "recursive" capability like I pitched in KIP-683 would also become
> unnecessary because you could easily write a JsonPath expression like
> "$..someRecuriveField" and it would do the same thing (on top of anything
> else you would want to do that is already supported by JsonPath). Then we
> could also kill that older KIP and do a bit of clean-up :)
>
> [1] -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-683%3A+Add+recursive+support+to+Connect+Cast+and+ReplaceField+transforms%2C+and+support+for+casting+complex+types+to+either+a+native+or+JSON+string
>
> Just some extra food for thought. All-in-all I think this is a super great
> initiative!
>
> Best,
>
> Joshua
>
>
> Den fre 22 apr. 2022 kl 14:50 skrev Chris Egerton  >:
>
> > Hi Tom,
> >
> > Thanks for taking a look at this, and for your thoughtful comments. I'll
> > leave it up to Jorge to address most of your comments but I wanted to
> share
> > a couple quick thoughts I had regarding 103 and 104.
> >
> > 103. Like you, I was envisioning a possible syntax for array access that
> > used classic C-style brackets; e.g., `arr[index]`. However, I wonder 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-22 Thread Joshua Grisham
Hello all!

Sorry that I come a bit later to the party here, but I am the one who wrote
KIP-683 [1] for recursive support (just simply looping through all child
non-primitive structures for the same matching name(s)) which is a slightly
different way to try and solve a similar requirement -- unfortunately at
the time the dev community was not quite as active and then I also got busy
with work and just life in general so wasn't able to follow up or push it.

I do think it is a very good idea to have some kind of path-like expression
to be able to specifically address a nested field, as I can see that the
simple "recursive" case could potentially result in unwanted or unexpected
behavior, plus there is the potential to introduce a bit of a performance
hit to always loop through everything in cases where the schemas/values
might be quite large.

One thing I wanted to ask: instead of creating a new "path parser"
including some bespoke or "borrowed" syntax, why not just use something
that already exists? Specifically here I am thinking about JsonPath (
https://github.com/json-path/JsonPath)

There is already quite nice support in JsonPath for handling special
characters in field names, for handling different non-primitive types
(arrays etc), for handling multiple levels of nesting, etc etc.  Would it
be possible to instead to re-think this and maybe have some kind of
JsonPath-based Schema selector / updater and/or JsonPath-based Value
selector / updater? Conceptually this feels like it makes sense to me, as
from the top of my head it would be quite a natural fit to map a Json data
structure to the Connect API data structure (and you could potentially even
try to leverage the existing Json-to-Connect serializer/deserializer to
help out with this even in a more "out of the box"-feeling kind of way).

Maybe also as Tom mentioned, this part (in my example, this JsonPath-based
"thing") could even be a generic API that could be used by any SMT,
including used in custom ones built by the community.  Then I think to use
a completely separate config property somehow related to "path" (as Tom
also mentioned) would also make a lot of sense here as well. This way, if
you select based on "path" then this JsonPath-based API would be used,
otherwise it could use something similar to the existing get-field based
approach (which I guess could also be refactored into some kind of utility
/ API as well if it made sense?)

And with that in mind, if this was the kind of direction to go, then a
"recursive" capability like I pitched in KIP-683 would also become
unnecessary because you could easily write a JsonPath expression like
"$..someRecuriveField" and it would do the same thing (on top of anything
else you would want to do that is already supported by JsonPath). Then we
could also kill that older KIP and do a bit of clean-up :)

[1] -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-683%3A+Add+recursive+support+to+Connect+Cast+and+ReplaceField+transforms%2C+and+support+for+casting+complex+types+to+either+a+native+or+JSON+string

Just some extra food for thought. All-in-all I think this is a super great
initiative!

Best,

Joshua


Den fre 22 apr. 2022 kl 14:50 skrev Chris Egerton :

> Hi Tom,
>
> Thanks for taking a look at this, and for your thoughtful comments. I'll
> leave it up to Jorge to address most of your comments but I wanted to share
> a couple quick thoughts I had regarding 103 and 104.
>
> 103. Like you, I was envisioning a possible syntax for array access that
> used classic C-style brackets; e.g., `arr[index]`. However, I wonder if we
> could keep things simple and use the same syntax that we're proposing for
> nested field access? In other words, instead of `arr[index]`, you'd write
> `arr.index`. It'd save us and users the headache of reserving characters
> now that would need to be escaped even if their unescaped brethren aren't
> used for anything, and also avoid the question of what exactly we should do
> when we see a config that uses reserved characters that aren't yet
> supported (throwing an exception seems pretty unfriendly for new users).
>
> 104. This would probably be useful, but it would come with some nasty
> compatibility questions that would need to be addressed if we'd want SMTs
> that leverage this new API to be viable for older versions of Connect. If
> we package and distribute this feature as a library (either via an entirely
> new artifact, or as part of the existing connect-transforms or connect-api
> artifacts), then we'd have to either sidestep the existing plugin isolation
> logic [1] that basically makes it impossible for Connect plugins to ship
> their own versions of Connect artifacts, or issue a big warning to people
> that any SMT that uses this API won't work with any older versions of
> Connect. There's also some other features we might want to add in an
> SMT-utils library such as the existing, internal, utils that Connect uses
> right now [2]. It may be worth exploring 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-22 Thread Chris Egerton
Hi Tom,

Thanks for taking a look at this, and for your thoughtful comments. I'll
leave it up to Jorge to address most of your comments but I wanted to share
a couple quick thoughts I had regarding 103 and 104.

103. Like you, I was envisioning a possible syntax for array access that
used classic C-style brackets; e.g., `arr[index]`. However, I wonder if we
could keep things simple and use the same syntax that we're proposing for
nested field access? In other words, instead of `arr[index]`, you'd write
`arr.index`. It'd save us and users the headache of reserving characters
now that would need to be escaped even if their unescaped brethren aren't
used for anything, and also avoid the question of what exactly we should do
when we see a config that uses reserved characters that aren't yet
supported (throwing an exception seems pretty unfriendly for new users).

104. This would probably be useful, but it would come with some nasty
compatibility questions that would need to be addressed if we'd want SMTs
that leverage this new API to be viable for older versions of Connect. If
we package and distribute this feature as a library (either via an entirely
new artifact, or as part of the existing connect-transforms or connect-api
artifacts), then we'd have to either sidestep the existing plugin isolation
logic [1] that basically makes it impossible for Connect plugins to ship
their own versions of Connect artifacts, or issue a big warning to people
that any SMT that uses this API won't work with any older versions of
Connect. There's also some other features we might want to add in an
SMT-utils library such as the existing, internal, utils that Connect uses
right now [2]. It may be worth exploring this in a separate KIP of its own.

[1] -
https://github.com/apache/kafka/blob/d480c4aa6e513e36050d8e067931de2270525d18/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/isolation/PluginUtils.java#L46-L143

[2] -
https://github.com/apache/kafka/tree/d480c4aa6e513e36050d8e067931de2270525d18/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/util

Cheers,

Chris

On Fri, Apr 22, 2022 at 6:55 AM Tom Bentley  wrote:

> Hi Jorge,
>
> Thanks for the KIP, especially for the examples which are super-clear.
>
> 100. The name `field.style` isn't so clear for something like ReplaceField:
> it's not so obvious that field.style applies to `include` and `exclude`.
>
> 101. The permitted values for `field.style` don't seem terribly intuitive
> (to me anyway): the meaning of `plain` isn't very guessable. Why not
> `top-level` or `root` instead? Also `nested` could be misconstrued to mean
> nested-but-not-top-level, so perhaps `recursive` or `cascading` might be
> better?
>
> 102. I'm torn on whether making the interpretation of existing configs like
> `field` be dependent on `field.style` is a good idea. I can see that it's
> the simplest thing to do, but it just feels a bit odd that sometimes the
> `field` would actually be a path and have different escaping rules. An
> alternative would be to come up with a parallel set of config names (e.g.
> as well as "field" an SMT might support "path") which were defined to
> always take paths, thus avoiding the changeable interpretation of the
> existing configs. The SMT's #configure() would need to throw in the case
> that both configs were given. I can see that that would be more work in
> implementation, but it feels cleaner.
>
> 103. I think in order to allow for supporting arrays in a later KIP (which
> certainly seems like it could be useful), we'd want to specify the syntax
> now, even if it wasn't implemented under this KIP. That's because I don't
> think you can't exclude the possibility that characters such as `[` and `]`
> appear in field names. So you'd have a compatibility problem if people
> started using the features of this KIP to access such fields, only for
> those characters to change their meaning under a later KIP.
>
> 104. I also wonder whether making paths into a public Java API, for use by
> 3rd party SMTs, would be valuable.
>
> Thanks again,
>
> Tom
>
>
>
> On Wed, 20 Apr 2022 at 17:53, Chris Egerton 
> wrote:
>
> > ï’¯ Thanks Jorge, LGTM!
> >
> > On Wed, Apr 20, 2022, 12:40 Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thank you, Chris! Not possible without your feedback.
> > >
> > > On Tue, 19 Apr 2022 at 23:04, Chris Egerton 
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thank you for sticking through this. I have one small remark and one
> > > small
> > > > clarification; assuming you agree with me on them then I'm ready to
> > vote
> > > on
> > > > the KIP.
> > > >
> > > > 1. InsertField: The "field.on.missing.parent" and
> > > "field.on.existing.field"
> > > > docs both mention a permitted value of "ingore"; this should be
> > "ignore",
> > > > right?
> > > >
> > >
> > > Of course, one more typo :)
> > >
> > >
> > > > 2. InsertField: The examples are still missing the "field.style"
> > property
> 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-22 Thread Tom Bentley
Hi Jorge,

Thanks for the KIP, especially for the examples which are super-clear.

100. The name `field.style` isn't so clear for something like ReplaceField:
it's not so obvious that field.style applies to `include` and `exclude`.

101. The permitted values for `field.style` don't seem terribly intuitive
(to me anyway): the meaning of `plain` isn't very guessable. Why not
`top-level` or `root` instead? Also `nested` could be misconstrued to mean
nested-but-not-top-level, so perhaps `recursive` or `cascading` might be
better?

102. I'm torn on whether making the interpretation of existing configs like
`field` be dependent on `field.style` is a good idea. I can see that it's
the simplest thing to do, but it just feels a bit odd that sometimes the
`field` would actually be a path and have different escaping rules. An
alternative would be to come up with a parallel set of config names (e.g.
as well as "field" an SMT might support "path") which were defined to
always take paths, thus avoiding the changeable interpretation of the
existing configs. The SMT's #configure() would need to throw in the case
that both configs were given. I can see that that would be more work in
implementation, but it feels cleaner.

103. I think in order to allow for supporting arrays in a later KIP (which
certainly seems like it could be useful), we'd want to specify the syntax
now, even if it wasn't implemented under this KIP. That's because I don't
think you can't exclude the possibility that characters such as `[` and `]`
appear in field names. So you'd have a compatibility problem if people
started using the features of this KIP to access such fields, only for
those characters to change their meaning under a later KIP.

104. I also wonder whether making paths into a public Java API, for use by
3rd party SMTs, would be valuable.

Thanks again,

Tom



On Wed, 20 Apr 2022 at 17:53, Chris Egerton  wrote:

> ï’¯ Thanks Jorge, LGTM!
>
> On Wed, Apr 20, 2022, 12:40 Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thank you, Chris! Not possible without your feedback.
> >
> > On Tue, 19 Apr 2022 at 23:04, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thank you for sticking through this. I have one small remark and one
> > small
> > > clarification; assuming you agree with me on them then I'm ready to
> vote
> > on
> > > the KIP.
> > >
> > > 1. InsertField: The "field.on.missing.parent" and
> > "field.on.existing.field"
> > > docs both mention a permitted value of "ingore"; this should be
> "ignore",
> > > right?
> > >
> >
> > Of course, one more typo :)
> >
> >
> > > 2. InsertField: The examples are still missing the "field.style"
> property
> > > from the configurations. They should all include the property
> > > "transforms.smt1.field.style": "nested", correct?
> > >
> >
> > Yes, it is there. I think I know what you mean now, seems that Confluence
> > is putting everything in one line when it's in separate lines in the
> > editor.
> > Hopefully, it's fixed now.
> >
> >
> > >
> > > Thanks again for working through this, and congratulations on a
> > > well-written KIP!
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Apr 19, 2022 at 2:06 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jo...@gmail.com> wrote:
> > >
> > > > Thank you, Chris! I apply these improvements to the KIP, let me know
> > how
> > > it
> > > > looks now.
> > > >
> > > > On Mon, 11 Apr 2022 at 23:43, Chris Egerton  >
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Wow, those examples are great! A few more remarks, but I think
> we're
> > > > > getting close:
> > > > >
> > > > > 1. The examples differ across SMTs with the name of the
> > > newly-introduced
> > > > > style property; some of them use "field.style", and some use
> > > > > "fields.style". I think for consistency's sake we should stick with
> > > just
> > > > > "field.style"; otherwise it could be painful for users to have to
> > > > remember
> > > > > which to use.
> > > > >
> > > >
> > > > Great catch. Agree, I fixed the config names to `field.style`.
> > > >
> > > >
> > > > >
> > > > > 2. Some of the examples are off:
> > > > > - TimestampConverter: the input in the second example ("when field
> > > names
> > > > > include dots") doesn't contain a field with a dotted name
> > > > > - ValueToKey: the config in the third example ("when field names
> > > include
> > > > > dots") should probably use "parent..child.k2" as the
> > > > > "transforms.smt1.fields" property
> > > > >
> > > >
> > > > Fixed. Thanks!
> > > >
> > > >
> > > > >
> > > > > 3. RE changes to InsertField:
> > > > > - The InsertField SMT should also come with the new "field.style"
> > > > property
> > > > > in order to preserve backwards compatibility, right? I don't see it
> > > > > included in the example configs for that one, just want to make
> sure
> > > > > - I don't know of any cases where we use snake_case for property
> > names
> > > in
> > > > > Kafka; we should probably 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-20 Thread Chris Egerton
ï’¯ Thanks Jorge, LGTM!

On Wed, Apr 20, 2022, 12:40 Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thank you, Chris! Not possible without your feedback.
>
> On Tue, 19 Apr 2022 at 23:04, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Thank you for sticking through this. I have one small remark and one
> small
> > clarification; assuming you agree with me on them then I'm ready to vote
> on
> > the KIP.
> >
> > 1. InsertField: The "field.on.missing.parent" and
> "field.on.existing.field"
> > docs both mention a permitted value of "ingore"; this should be "ignore",
> > right?
> >
>
> Of course, one more typo :)
>
>
> > 2. InsertField: The examples are still missing the "field.style" property
> > from the configurations. They should all include the property
> > "transforms.smt1.field.style": "nested", correct?
> >
>
> Yes, it is there. I think I know what you mean now, seems that Confluence
> is putting everything in one line when it's in separate lines in the
> editor.
> Hopefully, it's fixed now.
>
>
> >
> > Thanks again for working through this, and congratulations on a
> > well-written KIP!
> >
> > Cheers,
> >
> > Chris
> >
> > On Tue, Apr 19, 2022 at 2:06 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thank you, Chris! I apply these improvements to the KIP, let me know
> how
> > it
> > > looks now.
> > >
> > > On Mon, 11 Apr 2022 at 23:43, Chris Egerton 
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Wow, those examples are great! A few more remarks, but I think we're
> > > > getting close:
> > > >
> > > > 1. The examples differ across SMTs with the name of the
> > newly-introduced
> > > > style property; some of them use "field.style", and some use
> > > > "fields.style". I think for consistency's sake we should stick with
> > just
> > > > "field.style"; otherwise it could be painful for users to have to
> > > remember
> > > > which to use.
> > > >
> > >
> > > Great catch. Agree, I fixed the config names to `field.style`.
> > >
> > >
> > > >
> > > > 2. Some of the examples are off:
> > > > - TimestampConverter: the input in the second example ("when field
> > names
> > > > include dots") doesn't contain a field with a dotted name
> > > > - ValueToKey: the config in the third example ("when field names
> > include
> > > > dots") should probably use "parent..child.k2" as the
> > > > "transforms.smt1.fields" property
> > > >
> > >
> > > Fixed. Thanks!
> > >
> > >
> > > >
> > > > 3. RE changes to InsertField:
> > > > - The InsertField SMT should also come with the new "field.style"
> > > property
> > > > in order to preserve backwards compatibility, right? I don't see it
> > > > included in the example configs for that one, just want to make sure
> > > > - I don't know of any cases where we use snake_case for property
> names
> > in
> > > > Kafka; we should probably use "on.missing.parent" and
> > "on.existing.field"
> > > > as the new property names for InsertField.
> > > > - Why is the "on_existing_field" (or "on.existing.field") property
> only
> > > > applied when the field style is nested? Couldn't it be useful for
> > > > non-nested fields as well?
> > > >
> > >
> > > Great points! I have applied these suggestions to the KIP.
> > >
> > >
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Sat, Apr 9, 2022 at 12:40 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jo...@gmail.com> wrote:
> > > >
> > > > > Again, great feedback Chris. Much appreciated.
> > > > > Added my comments below:
> > > > >
> > > > > On Tue, 5 Apr 2022 at 20:22, Chris Egerton <
> fearthecel...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Looking good! I have a few comments left but all but one or two
> are
> > > > > minor.
> > > > > >
> > > > > > 1. The motivation section states "This KIP is aimed to include
> > > support
> > > > > for
> > > > > > nested structures on the existing SMTs... and to include the
> > > > abstractions
> > > > > > to reuse this in future SMTs". A good implementation of this KIP
> > will
> > > > > > definitely isolate reusable logic into a separate abstraction
> that
> > > can
> > > > be
> > > > > > easily pulled in to the SMTs we want to add nested field support
> > to,
> > > > but
> > > > > > unless we plan on making this kind of abstraction publicly
> > available
> > > as
> > > > > > some kind of utility method or class that external SMT developers
> > can
> > > > > > leverage, we can probably leave this part out as it's more of an
> > > > > > implementation detail.
> > > > > >
> > > > >
> > > > > Make sense, will leave this out of the KIP.
> > > > >
> > > > >
> > > > > >
> > > > > > 2. The Cast example is a little misleading, isn't it? It
> > demonstrates
> > > > the
> > > > > > escape syntax for fields with dot literals in their names, but it
> > > > doesn't
> > > > > > demonstrate a way to actually use the Cast (or any other) SMT to
> > > > access a
> > > > > > nested field in a record, 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-20 Thread Jorge Esteban Quilcate Otoya
Thank you, Chris! Not possible without your feedback.

On Tue, 19 Apr 2022 at 23:04, Chris Egerton  wrote:

> Hi Jorge,
>
> Thank you for sticking through this. I have one small remark and one small
> clarification; assuming you agree with me on them then I'm ready to vote on
> the KIP.
>
> 1. InsertField: The "field.on.missing.parent" and "field.on.existing.field"
> docs both mention a permitted value of "ingore"; this should be "ignore",
> right?
>

Of course, one more typo :)


> 2. InsertField: The examples are still missing the "field.style" property
> from the configurations. They should all include the property
> "transforms.smt1.field.style": "nested", correct?
>

Yes, it is there. I think I know what you mean now, seems that Confluence
is putting everything in one line when it's in separate lines in the editor.
Hopefully, it's fixed now.


>
> Thanks again for working through this, and congratulations on a
> well-written KIP!
>
> Cheers,
>
> Chris
>
> On Tue, Apr 19, 2022 at 2:06 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thank you, Chris! I apply these improvements to the KIP, let me know how
> it
> > looks now.
> >
> > On Mon, 11 Apr 2022 at 23:43, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Wow, those examples are great! A few more remarks, but I think we're
> > > getting close:
> > >
> > > 1. The examples differ across SMTs with the name of the
> newly-introduced
> > > style property; some of them use "field.style", and some use
> > > "fields.style". I think for consistency's sake we should stick with
> just
> > > "field.style"; otherwise it could be painful for users to have to
> > remember
> > > which to use.
> > >
> >
> > Great catch. Agree, I fixed the config names to `field.style`.
> >
> >
> > >
> > > 2. Some of the examples are off:
> > > - TimestampConverter: the input in the second example ("when field
> names
> > > include dots") doesn't contain a field with a dotted name
> > > - ValueToKey: the config in the third example ("when field names
> include
> > > dots") should probably use "parent..child.k2" as the
> > > "transforms.smt1.fields" property
> > >
> >
> > Fixed. Thanks!
> >
> >
> > >
> > > 3. RE changes to InsertField:
> > > - The InsertField SMT should also come with the new "field.style"
> > property
> > > in order to preserve backwards compatibility, right? I don't see it
> > > included in the example configs for that one, just want to make sure
> > > - I don't know of any cases where we use snake_case for property names
> in
> > > Kafka; we should probably use "on.missing.parent" and
> "on.existing.field"
> > > as the new property names for InsertField.
> > > - Why is the "on_existing_field" (or "on.existing.field") property only
> > > applied when the field style is nested? Couldn't it be useful for
> > > non-nested fields as well?
> > >
> >
> > Great points! I have applied these suggestions to the KIP.
> >
> >
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Sat, Apr 9, 2022 at 12:40 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jo...@gmail.com> wrote:
> > >
> > > > Again, great feedback Chris. Much appreciated.
> > > > Added my comments below:
> > > >
> > > > On Tue, 5 Apr 2022 at 20:22, Chris Egerton 
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Looking good! I have a few comments left but all but one or two are
> > > > minor.
> > > > >
> > > > > 1. The motivation section states "This KIP is aimed to include
> > support
> > > > for
> > > > > nested structures on the existing SMTs... and to include the
> > > abstractions
> > > > > to reuse this in future SMTs". A good implementation of this KIP
> will
> > > > > definitely isolate reusable logic into a separate abstraction that
> > can
> > > be
> > > > > easily pulled in to the SMTs we want to add nested field support
> to,
> > > but
> > > > > unless we plan on making this kind of abstraction publicly
> available
> > as
> > > > > some kind of utility method or class that external SMT developers
> can
> > > > > leverage, we can probably leave this part out as it's more of an
> > > > > implementation detail.
> > > > >
> > > >
> > > > Make sense, will leave this out of the KIP.
> > > >
> > > >
> > > > >
> > > > > 2. The Cast example is a little misleading, isn't it? It
> demonstrates
> > > the
> > > > > escape syntax for fields with dot literals in their names, but it
> > > doesn't
> > > > > demonstrate a way to actually use the Cast (or any other) SMT to
> > > access a
> > > > > nested field in a record, which is the whole point of the KIP. I
> like
> > > the
> > > > > example of escape syntax but we should probably also add one for
> > nested
> > > > > field access.
> > > > >
> > > >
> > > > Agree. I have added examples to each SMT to be more clear about how
> it
> > > > affects each function
> > > > .
> > > >
> > > >
> > > > >
> > > > > 3. With the InsertField SMT, I'm wondering what the specific
> behavior
> > > > will
> > > > > be when some or all 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-19 Thread Chris Egerton
Hi Jorge,

Thank you for sticking through this. I have one small remark and one small
clarification; assuming you agree with me on them then I'm ready to vote on
the KIP.

1. InsertField: The "field.on.missing.parent" and "field.on.existing.field"
docs both mention a permitted value of "ingore"; this should be "ignore",
right?
2. InsertField: The examples are still missing the "field.style" property
from the configurations. They should all include the property
"transforms.smt1.field.style": "nested", correct?

Thanks again for working through this, and congratulations on a
well-written KIP!

Cheers,

Chris

On Tue, Apr 19, 2022 at 2:06 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thank you, Chris! I apply these improvements to the KIP, let me know how it
> looks now.
>
> On Mon, 11 Apr 2022 at 23:43, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Wow, those examples are great! A few more remarks, but I think we're
> > getting close:
> >
> > 1. The examples differ across SMTs with the name of the newly-introduced
> > style property; some of them use "field.style", and some use
> > "fields.style". I think for consistency's sake we should stick with just
> > "field.style"; otherwise it could be painful for users to have to
> remember
> > which to use.
> >
>
> Great catch. Agree, I fixed the config names to `field.style`.
>
>
> >
> > 2. Some of the examples are off:
> > - TimestampConverter: the input in the second example ("when field names
> > include dots") doesn't contain a field with a dotted name
> > - ValueToKey: the config in the third example ("when field names include
> > dots") should probably use "parent..child.k2" as the
> > "transforms.smt1.fields" property
> >
>
> Fixed. Thanks!
>
>
> >
> > 3. RE changes to InsertField:
> > - The InsertField SMT should also come with the new "field.style"
> property
> > in order to preserve backwards compatibility, right? I don't see it
> > included in the example configs for that one, just want to make sure
> > - I don't know of any cases where we use snake_case for property names in
> > Kafka; we should probably use "on.missing.parent" and "on.existing.field"
> > as the new property names for InsertField.
> > - Why is the "on_existing_field" (or "on.existing.field") property only
> > applied when the field style is nested? Couldn't it be useful for
> > non-nested fields as well?
> >
>
> Great points! I have applied these suggestions to the KIP.
>
>
> >
> > Cheers,
> >
> > Chris
> >
> > On Sat, Apr 9, 2022 at 12:40 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Again, great feedback Chris. Much appreciated.
> > > Added my comments below:
> > >
> > > On Tue, 5 Apr 2022 at 20:22, Chris Egerton 
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Looking good! I have a few comments left but all but one or two are
> > > minor.
> > > >
> > > > 1. The motivation section states "This KIP is aimed to include
> support
> > > for
> > > > nested structures on the existing SMTs... and to include the
> > abstractions
> > > > to reuse this in future SMTs". A good implementation of this KIP will
> > > > definitely isolate reusable logic into a separate abstraction that
> can
> > be
> > > > easily pulled in to the SMTs we want to add nested field support to,
> > but
> > > > unless we plan on making this kind of abstraction publicly available
> as
> > > > some kind of utility method or class that external SMT developers can
> > > > leverage, we can probably leave this part out as it's more of an
> > > > implementation detail.
> > > >
> > >
> > > Make sense, will leave this out of the KIP.
> > >
> > >
> > > >
> > > > 2. The Cast example is a little misleading, isn't it? It demonstrates
> > the
> > > > escape syntax for fields with dot literals in their names, but it
> > doesn't
> > > > demonstrate a way to actually use the Cast (or any other) SMT to
> > access a
> > > > nested field in a record, which is the whole point of the KIP. I like
> > the
> > > > example of escape syntax but we should probably also add one for
> nested
> > > > field access.
> > > >
> > >
> > > Agree. I have added examples to each SMT to be more clear about how it
> > > affects each function
> > > .
> > >
> > >
> > > >
> > > > 3. With the InsertField SMT, I'm wondering what the specific behavior
> > > will
> > > > be when some or all of the "middle layer" nested fields are missing.
> > For
> > > > example, if I have a record with a value of { "k1": "v1 } and I apply
> > > > InsertField with topic.field = n1.n2.n3.topic, what will happen? Will
> > the
> > > > updated value become { "k1": "v1", "n1": { "n2": { "n3": "topic" }}},
> > > will
> > > > an exception be thrown, or something else? This seems analogous to
> the
> > > > command line mkdir command, which (at least on some operating
> systems)
> > > > fails by default if you try to create a new nested directory where
> > > anything
> > > > but the last element in the path doesn't exist, but 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-19 Thread Jorge Esteban Quilcate Otoya
Thank you, Chris! I apply these improvements to the KIP, let me know how it
looks now.

On Mon, 11 Apr 2022 at 23:43, Chris Egerton  wrote:

> Hi Jorge,
>
> Wow, those examples are great! A few more remarks, but I think we're
> getting close:
>
> 1. The examples differ across SMTs with the name of the newly-introduced
> style property; some of them use "field.style", and some use
> "fields.style". I think for consistency's sake we should stick with just
> "field.style"; otherwise it could be painful for users to have to remember
> which to use.
>

Great catch. Agree, I fixed the config names to `field.style`.


>
> 2. Some of the examples are off:
> - TimestampConverter: the input in the second example ("when field names
> include dots") doesn't contain a field with a dotted name
> - ValueToKey: the config in the third example ("when field names include
> dots") should probably use "parent..child.k2" as the
> "transforms.smt1.fields" property
>

Fixed. Thanks!


>
> 3. RE changes to InsertField:
> - The InsertField SMT should also come with the new "field.style" property
> in order to preserve backwards compatibility, right? I don't see it
> included in the example configs for that one, just want to make sure
> - I don't know of any cases where we use snake_case for property names in
> Kafka; we should probably use "on.missing.parent" and "on.existing.field"
> as the new property names for InsertField.
> - Why is the "on_existing_field" (or "on.existing.field") property only
> applied when the field style is nested? Couldn't it be useful for
> non-nested fields as well?
>

Great points! I have applied these suggestions to the KIP.


>
> Cheers,
>
> Chris
>
> On Sat, Apr 9, 2022 at 12:40 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Again, great feedback Chris. Much appreciated.
> > Added my comments below:
> >
> > On Tue, 5 Apr 2022 at 20:22, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Looking good! I have a few comments left but all but one or two are
> > minor.
> > >
> > > 1. The motivation section states "This KIP is aimed to include support
> > for
> > > nested structures on the existing SMTs... and to include the
> abstractions
> > > to reuse this in future SMTs". A good implementation of this KIP will
> > > definitely isolate reusable logic into a separate abstraction that can
> be
> > > easily pulled in to the SMTs we want to add nested field support to,
> but
> > > unless we plan on making this kind of abstraction publicly available as
> > > some kind of utility method or class that external SMT developers can
> > > leverage, we can probably leave this part out as it's more of an
> > > implementation detail.
> > >
> >
> > Make sense, will leave this out of the KIP.
> >
> >
> > >
> > > 2. The Cast example is a little misleading, isn't it? It demonstrates
> the
> > > escape syntax for fields with dot literals in their names, but it
> doesn't
> > > demonstrate a way to actually use the Cast (or any other) SMT to
> access a
> > > nested field in a record, which is the whole point of the KIP. I like
> the
> > > example of escape syntax but we should probably also add one for nested
> > > field access.
> > >
> >
> > Agree. I have added examples to each SMT to be more clear about how it
> > affects each function
> > .
> >
> >
> > >
> > > 3. With the InsertField SMT, I'm wondering what the specific behavior
> > will
> > > be when some or all of the "middle layer" nested fields are missing.
> For
> > > example, if I have a record with a value of { "k1": "v1 } and I apply
> > > InsertField with topic.field = n1.n2.n3.topic, what will happen? Will
> the
> > > updated value become { "k1": "v1", "n1": { "n2": { "n3": "topic" }}},
> > will
> > > an exception be thrown, or something else? This seems analogous to the
> > > command line mkdir command, which (at least on some operating systems)
> > > fails by default if you try to create a new nested directory where
> > anything
> > > but the last element in the path doesn't exist, but can be invoked
> with a
> > > flag that instructs it to go ahead and create all levels of nested
> > > directory instead. I'm leaning on the side of "just create everything"
> > but
> > > would be interested in your thoughts, and either way, we should
> probably
> > > make sure the intended behavior is well-defined before voting.
> > >
> >
> > This is an interesting case, thanks for catching this!
> > The default behavior I see is to create parents if they are missing and
> > overwrite fields
> > if they already exist.
> > I'm planning to include the following two flags if there is a need to
> > overwrite this behavior:
> > - `on_missing_parent` = (CREATE|IGNORE), default=CREATE
> > - `on_existing_field` = (OVERWRITE|IGNORE), default=OVERWRITE
> >
> >
> > >
> > > 4. Similarly, what will the behavior be if any of the field elements
> > > specified with InsertField already exist in the record value? Will we
> > just
> > > overwrite them? 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-11 Thread Chris Egerton
Hi Jorge,

Wow, those examples are great! A few more remarks, but I think we're
getting close:

1. The examples differ across SMTs with the name of the newly-introduced
style property; some of them use "field.style", and some use
"fields.style". I think for consistency's sake we should stick with just
"field.style"; otherwise it could be painful for users to have to remember
which to use.

2. Some of the examples are off:
- TimestampConverter: the input in the second example ("when field names
include dots") doesn't contain a field with a dotted name
- ValueToKey: the config in the third example ("when field names include
dots") should probably use "parent..child.k2" as the
"transforms.smt1.fields" property

3. RE changes to InsertField:
- The InsertField SMT should also come with the new "field.style" property
in order to preserve backwards compatibility, right? I don't see it
included in the example configs for that one, just want to make sure
- I don't know of any cases where we use snake_case for property names in
Kafka; we should probably use "on.missing.parent" and "on.existing.field"
as the new property names for InsertField.
- Why is the "on_existing_field" (or "on.existing.field") property only
applied when the field style is nested? Couldn't it be useful for
non-nested fields as well?

Cheers,

Chris

On Sat, Apr 9, 2022 at 12:40 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Again, great feedback Chris. Much appreciated.
> Added my comments below:
>
> On Tue, 5 Apr 2022 at 20:22, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Looking good! I have a few comments left but all but one or two are
> minor.
> >
> > 1. The motivation section states "This KIP is aimed to include support
> for
> > nested structures on the existing SMTs... and to include the abstractions
> > to reuse this in future SMTs". A good implementation of this KIP will
> > definitely isolate reusable logic into a separate abstraction that can be
> > easily pulled in to the SMTs we want to add nested field support to, but
> > unless we plan on making this kind of abstraction publicly available as
> > some kind of utility method or class that external SMT developers can
> > leverage, we can probably leave this part out as it's more of an
> > implementation detail.
> >
>
> Make sense, will leave this out of the KIP.
>
>
> >
> > 2. The Cast example is a little misleading, isn't it? It demonstrates the
> > escape syntax for fields with dot literals in their names, but it doesn't
> > demonstrate a way to actually use the Cast (or any other) SMT to access a
> > nested field in a record, which is the whole point of the KIP. I like the
> > example of escape syntax but we should probably also add one for nested
> > field access.
> >
>
> Agree. I have added examples to each SMT to be more clear about how it
> affects each function
> .
>
>
> >
> > 3. With the InsertField SMT, I'm wondering what the specific behavior
> will
> > be when some or all of the "middle layer" nested fields are missing. For
> > example, if I have a record with a value of { "k1": "v1 } and I apply
> > InsertField with topic.field = n1.n2.n3.topic, what will happen? Will the
> > updated value become { "k1": "v1", "n1": { "n2": { "n3": "topic" }}},
> will
> > an exception be thrown, or something else? This seems analogous to the
> > command line mkdir command, which (at least on some operating systems)
> > fails by default if you try to create a new nested directory where
> anything
> > but the last element in the path doesn't exist, but can be invoked with a
> > flag that instructs it to go ahead and create all levels of nested
> > directory instead. I'm leaning on the side of "just create everything"
> but
> > would be interested in your thoughts, and either way, we should probably
> > make sure the intended behavior is well-defined before voting.
> >
>
> This is an interesting case, thanks for catching this!
> The default behavior I see is to create parents if they are missing and
> overwrite fields
> if they already exist.
> I'm planning to include the following two flags if there is a need to
> overwrite this behavior:
> - `on_missing_parent` = (CREATE|IGNORE), default=CREATE
> - `on_existing_field` = (OVERWRITE|IGNORE), default=OVERWRITE
>
>
> >
> > 4. Similarly, what will the behavior be if any of the field elements
> > specified with InsertField already exist in the record value? Will we
> just
> > overwrite them? What's the behavior of InsertField today under similar
> > circumstances?
> >
>
> The current behavior is to overwrite the value.
>
>
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Mar 31, 2022 at 4:15 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jo...@gmail.com> wrote:
> >
> > > Thanks, Chris! Much appreciated all the feedback here.
> > >
> > > 1. You nailed it setting the design goal here: "it shouldn't be
> > impossible
> > > to use this new feature for any field name, no matter how convoluted.
> > It's
> > > fine if edge 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-09 Thread Jorge Esteban Quilcate Otoya
Again, great feedback Chris. Much appreciated.
Added my comments below:

On Tue, 5 Apr 2022 at 20:22, Chris Egerton  wrote:

> Hi Jorge,
>
> Looking good! I have a few comments left but all but one or two are minor.
>
> 1. The motivation section states "This KIP is aimed to include support for
> nested structures on the existing SMTs... and to include the abstractions
> to reuse this in future SMTs". A good implementation of this KIP will
> definitely isolate reusable logic into a separate abstraction that can be
> easily pulled in to the SMTs we want to add nested field support to, but
> unless we plan on making this kind of abstraction publicly available as
> some kind of utility method or class that external SMT developers can
> leverage, we can probably leave this part out as it's more of an
> implementation detail.
>

Make sense, will leave this out of the KIP.


>
> 2. The Cast example is a little misleading, isn't it? It demonstrates the
> escape syntax for fields with dot literals in their names, but it doesn't
> demonstrate a way to actually use the Cast (or any other) SMT to access a
> nested field in a record, which is the whole point of the KIP. I like the
> example of escape syntax but we should probably also add one for nested
> field access.
>

Agree. I have added examples to each SMT to be more clear about how it
affects each function
.


>
> 3. With the InsertField SMT, I'm wondering what the specific behavior will
> be when some or all of the "middle layer" nested fields are missing. For
> example, if I have a record with a value of { "k1": "v1 } and I apply
> InsertField with topic.field = n1.n2.n3.topic, what will happen? Will the
> updated value become { "k1": "v1", "n1": { "n2": { "n3": "topic" }}}, will
> an exception be thrown, or something else? This seems analogous to the
> command line mkdir command, which (at least on some operating systems)
> fails by default if you try to create a new nested directory where anything
> but the last element in the path doesn't exist, but can be invoked with a
> flag that instructs it to go ahead and create all levels of nested
> directory instead. I'm leaning on the side of "just create everything" but
> would be interested in your thoughts, and either way, we should probably
> make sure the intended behavior is well-defined before voting.
>

This is an interesting case, thanks for catching this!
The default behavior I see is to create parents if they are missing and
overwrite fields
if they already exist.
I'm planning to include the following two flags if there is a need to
overwrite this behavior:
- `on_missing_parent` = (CREATE|IGNORE), default=CREATE
- `on_existing_field` = (OVERWRITE|IGNORE), default=OVERWRITE


>
> 4. Similarly, what will the behavior be if any of the field elements
> specified with InsertField already exist in the record value? Will we just
> overwrite them? What's the behavior of InsertField today under similar
> circumstances?
>

The current behavior is to overwrite the value.


>
> Cheers,
>
> Chris
>
> On Thu, Mar 31, 2022 at 4:15 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thanks, Chris! Much appreciated all the feedback here.
> >
> > 1. You nailed it setting the design goal here: "it shouldn't be
> impossible
> > to use this new feature for any field name, no matter how convoluted.
> It's
> > fine if edge cases introduce difficulty (such as less-readable
> > configurations), but it's not fine if they can't be addressed at all."
> > Back to the previous proposals (using only dots as separators) we have 2
> > alternatives:
> > 1. escaping with backslashes
> > 2. escaping with dots itself
> >
> > I'll lean for alternative 2, as you proposed before. Feels to me that
> > backslashes have more potential to lead to confusion in JSON configs, and
> > CSV seems like a good precedent to use the same character to escape
> itself.
> > KIP is updated to reflect this.
> >
> > 2. Thanks! I'll add an example, and stick with the current approach
> > defining the style per individual transform configuration.
> >
> > 3. Yes, thanks! KIP updated.
> >
> > 4. Of course. KIP updated.
> >
> > On Mon, 28 Mar 2022 at 21:59, Chris Egerton 
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for addressing my comments; the KIP looks up-to-date and pretty
> > > readable now, and the rejected alternatives section does a great job of
> > > outlining the discussion so far and providing context for anyone else
> who
> > > might want to join in.
> > >
> > > 1. Thoughts on choice of delimiter:
> > > - I like the optimization for simple cases, but I think the new
> proposal
> > is
> > > a little too restrictive. What if there's a field whose name contains
> all
> > > of the permitted options (currently just ".", ",", and "/")?
> > > - If we expand the set of permitted delimiters to allow for any
> > > single-character string, configuration complexity will increase and
> > > readability may decrease
> > > - Also worth 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-04-05 Thread Chris Egerton
Hi Jorge,

Looking good! I have a few comments left but all but one or two are minor.

1. The motivation section states "This KIP is aimed to include support for
nested structures on the existing SMTs... and to include the abstractions
to reuse this in future SMTs". A good implementation of this KIP will
definitely isolate reusable logic into a separate abstraction that can be
easily pulled in to the SMTs we want to add nested field support to, but
unless we plan on making this kind of abstraction publicly available as
some kind of utility method or class that external SMT developers can
leverage, we can probably leave this part out as it's more of an
implementation detail.

2. The Cast example is a little misleading, isn't it? It demonstrates the
escape syntax for fields with dot literals in their names, but it doesn't
demonstrate a way to actually use the Cast (or any other) SMT to access a
nested field in a record, which is the whole point of the KIP. I like the
example of escape syntax but we should probably also add one for nested
field access.

3. With the InsertField SMT, I'm wondering what the specific behavior will
be when some or all of the "middle layer" nested fields are missing. For
example, if I have a record with a value of { "k1": "v1 } and I apply
InsertField with topic.field = n1.n2.n3.topic, what will happen? Will the
updated value become { "k1": "v1", "n1": { "n2": { "n3": "topic" }}}, will
an exception be thrown, or something else? This seems analogous to the
command line mkdir command, which (at least on some operating systems)
fails by default if you try to create a new nested directory where anything
but the last element in the path doesn't exist, but can be invoked with a
flag that instructs it to go ahead and create all levels of nested
directory instead. I'm leaning on the side of "just create everything" but
would be interested in your thoughts, and either way, we should probably
make sure the intended behavior is well-defined before voting.

4. Similarly, what will the behavior be if any of the field elements
specified with InsertField already exist in the record value? Will we just
overwrite them? What's the behavior of InsertField today under similar
circumstances?

Cheers,

Chris

On Thu, Mar 31, 2022 at 4:15 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris! Much appreciated all the feedback here.
>
> 1. You nailed it setting the design goal here: "it shouldn't be impossible
> to use this new feature for any field name, no matter how convoluted. It's
> fine if edge cases introduce difficulty (such as less-readable
> configurations), but it's not fine if they can't be addressed at all."
> Back to the previous proposals (using only dots as separators) we have 2
> alternatives:
> 1. escaping with backslashes
> 2. escaping with dots itself
>
> I'll lean for alternative 2, as you proposed before. Feels to me that
> backslashes have more potential to lead to confusion in JSON configs, and
> CSV seems like a good precedent to use the same character to escape itself.
> KIP is updated to reflect this.
>
> 2. Thanks! I'll add an example, and stick with the current approach
> defining the style per individual transform configuration.
>
> 3. Yes, thanks! KIP updated.
>
> 4. Of course. KIP updated.
>
> On Mon, 28 Mar 2022 at 21:59, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for addressing my comments; the KIP looks up-to-date and pretty
> > readable now, and the rejected alternatives section does a great job of
> > outlining the discussion so far and providing context for anyone else who
> > might want to join in.
> >
> > 1. Thoughts on choice of delimiter:
> > - I like the optimization for simple cases, but I think the new proposal
> is
> > a little too restrictive. What if there's a field whose name contains all
> > of the permitted options (currently just ".", ",", and "/")?
> > - If we expand the set of permitted delimiters to allow for any
> > single-character string, configuration complexity will increase and
> > readability may decrease
> > - Also worth pointing out that there is some convention for doubling a
> > delimiter character as an escape mechanism with formats like CSV [1]
> > - Overall I think we may be approaching the saturation point for
> productive
> > discussion on delimiter syntax so I don't want to spend too much more of
> > your time on it. I think the one point I'd like to leave for now is that
> it
> > shouldn't be impossible to use this new feature for any field name, no
> > matter how convoluted. It's fine if edge cases introduce difficulty (such
> > as less-readable configurations), but it's not fine if they can't be
> > addressed at all.
> >
> > 2.
> > The configuration style where you define "transforms.field.style" in the
> > connector config, and then this applies to all SMTs for the connector, is
> > very interesting. However, it doesn't follow convention for existing
> SMTs.
> > Right now, if you want to 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-03-31 Thread Jorge Esteban Quilcate Otoya
Thanks, Chris! Much appreciated all the feedback here.

1. You nailed it setting the design goal here: "it shouldn't be impossible
to use this new feature for any field name, no matter how convoluted. It's
fine if edge cases introduce difficulty (such as less-readable
configurations), but it's not fine if they can't be addressed at all."
Back to the previous proposals (using only dots as separators) we have 2
alternatives:
1. escaping with backslashes
2. escaping with dots itself

I'll lean for alternative 2, as you proposed before. Feels to me that
backslashes have more potential to lead to confusion in JSON configs, and
CSV seems like a good precedent to use the same character to escape itself.
KIP is updated to reflect this.

2. Thanks! I'll add an example, and stick with the current approach
defining the style per individual transform configuration.

3. Yes, thanks! KIP updated.

4. Of course. KIP updated.

On Mon, 28 Mar 2022 at 21:59, Chris Egerton  wrote:

> Hi Jorge,
>
> Thanks for addressing my comments; the KIP looks up-to-date and pretty
> readable now, and the rejected alternatives section does a great job of
> outlining the discussion so far and providing context for anyone else who
> might want to join in.
>
> 1. Thoughts on choice of delimiter:
> - I like the optimization for simple cases, but I think the new proposal is
> a little too restrictive. What if there's a field whose name contains all
> of the permitted options (currently just ".", ",", and "/")?
> - If we expand the set of permitted delimiters to allow for any
> single-character string, configuration complexity will increase and
> readability may decrease
> - Also worth pointing out that there is some convention for doubling a
> delimiter character as an escape mechanism with formats like CSV [1]
> - Overall I think we may be approaching the saturation point for productive
> discussion on delimiter syntax so I don't want to spend too much more of
> your time on it. I think the one point I'd like to leave for now is that it
> shouldn't be impossible to use this new feature for any field name, no
> matter how convoluted. It's fine if edge cases introduce difficulty (such
> as less-readable configurations), but it's not fine if they can't be
> addressed at all.
>
> 2.
> The configuration style where you define "transforms.field.style" in the
> connector config, and then this applies to all SMTs for the connector, is
> very interesting. However, it doesn't follow convention for existing SMTs.
> Right now, if you want to configure an SMT, you define its name in the
> connector config (for example, "transforms": "smt1"), and then define all
> of the properties for that SMT in the connector config using a namespacing
> mechanism specific to that SMT (for example, "transforms.smt1.prop1":
> "val1"). That SMT then sees only the properties defined in that namespace,
> with the prefix stripped (for example, "prop1": "val1") in its configure
> [2] [3] method.
> If we want to continue to follow this convention, then instead of
> specifying "transforms.field.style" in a connector config, we would expect
> users to configure "transforms..field.style", for each SMT that they
> want to configure a field style for. This would require more work on the
> part of the user, but would be simpler to reason about and easier to
> implement.
> If we want to explore an alternative where users can specify global
> properties that apply to all transforms in a connector config, then the
> semantics for this need to be defined in the KIP. This would have to
> include whether this will apply only for the special case of the
> "field.style" and possibly "field.separator" properties or if it would be
> available more generally for other properties, whether it will apply only
> for the SMTs outlined in the KIP or if the "field.style" and possibly
> "field.separator" properties would also be passed into custom SMTs so that
> they could choose to act on them if applicable, how edge cases like having
> an SMT named "field" in your connector config would be handled, etc.
> Either way, it might help to have an example in the KIP outlining how one
> of the to-be-augmented SMTs can be configured with this new feature and a
> before/after of how a record value would be transformed with that
> configuration.
>
> 3. The docstring for the "transforms.field.style" property mentions that
> the permitted values are "plain" and "nested", but then describes behavior
> with a value of "root". Should that be "plain" instead?
>
> 4. The docstring for the "transforms.field.separator" property exclusively
> mentions structs, but the feature is intended to work with maps as well.
> Can we update it to reflect this?
>
> References:
>
> [1] - https://stackoverflow.com/a/17808731
> [2] -
>
> https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java#L30
> [3] -
>
> 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-03-28 Thread Chris Egerton
Hi Jorge,

Thanks for addressing my comments; the KIP looks up-to-date and pretty
readable now, and the rejected alternatives section does a great job of
outlining the discussion so far and providing context for anyone else who
might want to join in.

1. Thoughts on choice of delimiter:
- I like the optimization for simple cases, but I think the new proposal is
a little too restrictive. What if there's a field whose name contains all
of the permitted options (currently just ".", ",", and "/")?
- If we expand the set of permitted delimiters to allow for any
single-character string, configuration complexity will increase and
readability may decrease
- Also worth pointing out that there is some convention for doubling a
delimiter character as an escape mechanism with formats like CSV [1]
- Overall I think we may be approaching the saturation point for productive
discussion on delimiter syntax so I don't want to spend too much more of
your time on it. I think the one point I'd like to leave for now is that it
shouldn't be impossible to use this new feature for any field name, no
matter how convoluted. It's fine if edge cases introduce difficulty (such
as less-readable configurations), but it's not fine if they can't be
addressed at all.

2.
The configuration style where you define "transforms.field.style" in the
connector config, and then this applies to all SMTs for the connector, is
very interesting. However, it doesn't follow convention for existing SMTs.
Right now, if you want to configure an SMT, you define its name in the
connector config (for example, "transforms": "smt1"), and then define all
of the properties for that SMT in the connector config using a namespacing
mechanism specific to that SMT (for example, "transforms.smt1.prop1":
"val1"). That SMT then sees only the properties defined in that namespace,
with the prefix stripped (for example, "prop1": "val1") in its configure
[2] [3] method.
If we want to continue to follow this convention, then instead of
specifying "transforms.field.style" in a connector config, we would expect
users to configure "transforms..field.style", for each SMT that they
want to configure a field style for. This would require more work on the
part of the user, but would be simpler to reason about and easier to
implement.
If we want to explore an alternative where users can specify global
properties that apply to all transforms in a connector config, then the
semantics for this need to be defined in the KIP. This would have to
include whether this will apply only for the special case of the
"field.style" and possibly "field.separator" properties or if it would be
available more generally for other properties, whether it will apply only
for the SMTs outlined in the KIP or if the "field.style" and possibly
"field.separator" properties would also be passed into custom SMTs so that
they could choose to act on them if applicable, how edge cases like having
an SMT named "field" in your connector config would be handled, etc.
Either way, it might help to have an example in the KIP outlining how one
of the to-be-augmented SMTs can be configured with this new feature and a
before/after of how a record value would be transformed with that
configuration.

3. The docstring for the "transforms.field.style" property mentions that
the permitted values are "plain" and "nested", but then describes behavior
with a value of "root". Should that be "plain" instead?

4. The docstring for the "transforms.field.separator" property exclusively
mentions structs, but the feature is intended to work with maps as well.
Can we update it to reflect this?

References:

[1] - https://stackoverflow.com/a/17808731
[2] -
https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/connect/api/src/main/java/org/apache/kafka/connect/transforms/Transformation.java#L30
[3] -
https://github.com/apache/kafka/blob/7243facb8d69a7252e6b9556b5eaee13e41bab7f/clients/src/main/java/org/apache/kafka/common/Configurable.java#L26-L29

Cheers,

Chris

On Mon, Mar 28, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thanks, Chris!
>
> 1. I'd argue "this..field.child" could be harder to grasp than
> "this.field/child" + separator: "/".
> Even though this represents additional information, it follows a similar
> approach as the "Flatten#delimeter" configuration.
> I want to give the separator approach another try, so I have updated the
> KIP with the separator proposal, sticking to only 2 alternatives that
> should hopefully cover most scenarios.
>
> 2. Agree. KIP has been updated with this improvement.
>
> 3. You're right. I have updated this section accordingly.
>
> 4. Good catch! I've replaced it with `DropHeaders`.
>
> Looking forward to your feedback.
>
> Thanks,
> Jorge.
>
> On Wed, 9 Mar 2022 at 21:33, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Looking good! Got a few more thoughts.
> >
> > 1. Sorry to revisit this, but I think we may want to adopt a slightly
> > 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-03-28 Thread Jorge Esteban Quilcate Otoya
Thanks, Chris!

1. I'd argue "this..field.child" could be harder to grasp than
"this.field/child" + separator: "/".
Even though this represents additional information, it follows a similar
approach as the "Flatten#delimeter" configuration.
I want to give the separator approach another try, so I have updated the
KIP with the separator proposal, sticking to only 2 alternatives that
should hopefully cover most scenarios.

2. Agree. KIP has been updated with this improvement.

3. You're right. I have updated this section accordingly.

4. Good catch! I've replaced it with `DropHeaders`.

Looking forward to your feedback.

Thanks,
Jorge.

On Wed, 9 Mar 2022 at 21:33, Chris Egerton  wrote:

> Hi Jorge,
>
> Looking good! Got a few more thoughts.
>
> 1. Sorry to revisit this, but I think we may want to adopt a slightly
> different escape syntax style. Backslashes are great, but since they're
> already used by JSON, using them as an escape sequence in field notation
> would also lead to some pretty ugly connector configs. Anyone who's had to
> write regular expressions with backslashes in Java is probably already
> familiar with this: "this.is.not.very.readable". What do
> you think about using the dot character to escape itself? In other words,
> to access a single field named "this.field", instead of using the syntax
> "this\.field" (which in JSON would have to be expressed as "this\\.field"),
> we could use "this..field", and for a single field named "this\field",
> instead of using the syntax "this\\field" (or, in JSON, "thisfield"),
> we could use "this\field" (or, in JSON, "this\\field").
>
> 2. Could you flesh out the details on the new "field.style" property,
> including the type, default value, importance, and a preliminary docstring?
> See
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-618%3A+Exactly-Once+Support+for+Source+Connectors#KIP618:ExactlyOnceSupportforSourceConnectors-Newproperties
> for an example.
>
> 3. Is the "Compatibility, Deprecation, and Migration Plan" section still
> accurate after the latest update? Seems like it's still written with the
> assumption that nested field syntax will be hardcoded or opt-in, which IIUC
> isn't the case anymore.
>
> 4. Nit: The "These SMTs do not require nested structure support" section
> mentions a "Drop" SMT. I think this may be referring to the Confluent Drop
> SMT, which isn't a part of Apache Kafka. Should we drop (heh) that SMT from
> the list? Or perhaps just replace it with "DropHeaders", which is currently
> missing from the list and shouldn't require any nested-field related
> updates?
>

> Cheers,
>
> Chris
>
> On Mon, Feb 28, 2022 at 2:12 PM Jorge Esteban Quilcate Otoya <
> quilcate.jo...@gmail.com> wrote:
>
> > Thank you, Chris! and sorry for the delayed response.
> >
> > Please, find my comments below:
> >
> > On Mon, 14 Feb 2022 at 17:34, Chris Egerton  >
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the KIP! I'd love to see support for nested fields added to
> > the
> > > out-of-the-box SMTs provided with Connect. Here are my initial
> thoughts:
> > >
> > > 1. I agree that there's a case to be made for expanding HoistField
> with a
> > > new config property for identifying a nested, to-be-hoisted field, but
> > the
> > > example in the KIP doesn't really demonstrate why this would be
> > valuable. I
> > > think it'd be helpful to expand the example to add other fields in
> order
> > to
> > > show how adding nested field support enables users to hoist a nested
> > field
> > > without dropping other fields from the value. Maybe something like
> this:
> > >
> > > source = nested.val
> > > field = line
> > >
> > > value (before):
> > > {
> > > "nested": {
> > > "val": 42,
> > > "other val": 96
> > > }
> > > }
> > >
> > > value (after):
> > > {
> > > "nested": {
> > > "line": {
> > > "val": 42,
> > > }
> > > "other val": 96
> > > }
> > > }
> > >
> > > 2. Nit: I think "source" is a little strange for the new HoistField
> > > property name. Maybe "hoisted" or "hoisted.field" would be more
> > > descriptive?
> > >
> > >
> > About 1. and 2.:
> > Agree. The example for this SMT is updated and have added the `hoisted`
> > configuration.
> >
> >
> > > 3. Is there a reasonable use case for expanding Flatten to be able to
> > > flatten specific fields? My understanding is that it's mostly useful
> for
> > > writing to systems like databases that don't support nested values and
> > > require everything to be a flat list of key-value pairs. Being able to
> > > flatten a nested field wouldn't provide any advantage for that use
> case.
> > > Are there other cases where it would?
> > >
> > > 4. I don't think we should unconditionally change the default delimiter
> > for
> > > Flatten. It's a backwards-incompatible, breaking 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-03-09 Thread Chris Egerton
Hi Jorge,

Looking good! Got a few more thoughts.

1. Sorry to revisit this, but I think we may want to adopt a slightly
different escape syntax style. Backslashes are great, but since they're
already used by JSON, using them as an escape sequence in field notation
would also lead to some pretty ugly connector configs. Anyone who's had to
write regular expressions with backslashes in Java is probably already
familiar with this: "this.is.not.very.readable". What do
you think about using the dot character to escape itself? In other words,
to access a single field named "this.field", instead of using the syntax
"this\.field" (which in JSON would have to be expressed as "this\\.field"),
we could use "this..field", and for a single field named "this\field",
instead of using the syntax "this\\field" (or, in JSON, "thisfield"),
we could use "this\field" (or, in JSON, "this\\field").

2. Could you flesh out the details on the new "field.style" property,
including the type, default value, importance, and a preliminary docstring?
See
https://cwiki.apache.org/confluence/display/KAFKA/KIP-618%3A+Exactly-Once+Support+for+Source+Connectors#KIP618:ExactlyOnceSupportforSourceConnectors-Newproperties
for an example.

3. Is the "Compatibility, Deprecation, and Migration Plan" section still
accurate after the latest update? Seems like it's still written with the
assumption that nested field syntax will be hardcoded or opt-in, which IIUC
isn't the case anymore.

4. Nit: The "These SMTs do not require nested structure support" section
mentions a "Drop" SMT. I think this may be referring to the Confluent Drop
SMT, which isn't a part of Apache Kafka. Should we drop (heh) that SMT from
the list? Or perhaps just replace it with "DropHeaders", which is currently
missing from the list and shouldn't require any nested-field related
updates?

Cheers,

Chris

On Mon, Feb 28, 2022 at 2:12 PM Jorge Esteban Quilcate Otoya <
quilcate.jo...@gmail.com> wrote:

> Thank you, Chris! and sorry for the delayed response.
>
> Please, find my comments below:
>
> On Mon, 14 Feb 2022 at 17:34, Chris Egerton 
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the KIP! I'd love to see support for nested fields added to
> the
> > out-of-the-box SMTs provided with Connect. Here are my initial thoughts:
> >
> > 1. I agree that there's a case to be made for expanding HoistField with a
> > new config property for identifying a nested, to-be-hoisted field, but
> the
> > example in the KIP doesn't really demonstrate why this would be
> valuable. I
> > think it'd be helpful to expand the example to add other fields in order
> to
> > show how adding nested field support enables users to hoist a nested
> field
> > without dropping other fields from the value. Maybe something like this:
> >
> > source = nested.val
> > field = line
> >
> > value (before):
> > {
> > "nested": {
> > "val": 42,
> > "other val": 96
> > }
> > }
> >
> > value (after):
> > {
> > "nested": {
> > "line": {
> > "val": 42,
> > }
> > "other val": 96
> > }
> > }
> >
> > 2. Nit: I think "source" is a little strange for the new HoistField
> > property name. Maybe "hoisted" or "hoisted.field" would be more
> > descriptive?
> >
> >
> About 1. and 2.:
> Agree. The example for this SMT is updated and have added the `hoisted`
> configuration.
>
>
> > 3. Is there a reasonable use case for expanding Flatten to be able to
> > flatten specific fields? My understanding is that it's mostly useful for
> > writing to systems like databases that don't support nested values and
> > require everything to be a flat list of key-value pairs. Being able to
> > flatten a nested field wouldn't provide any advantage for that use case.
> > Are there other cases where it would?
> >
> > 4. I don't think we should unconditionally change the default delimiter
> for
> > Flatten. It's a backwards-incompatible, breaking change that could cause
> > headaches for users. It might be reasonable to change the default value
> > dynamically based on whether the user has specified a value for the
> "field"
> > property, but considering the motivation for changing the default is that
> > it creates conflicts with the to-be-introduced nested field syntax (which
> > could arise with downstream SMTs regardless of whether the user has
> > explicitly configured Flatten with the "field" property), I don't know
> that
> > this would be too useful either. I have some thoughts below on how to
> > handle possible conflicts between names with dots in their names and
> dotted
> > syntax for nested field references that should hopefully make either
> change
> > unnecessary.
> >
> >
> Fair enough. With the support for nested fields in other SMTs, Flatten
> could stay as it is.
> This removes the need for (4) changing Flatten config as well.
>

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-02-28 Thread Jorge Esteban Quilcate Otoya
Thank you, Chris! and sorry for the delayed response.

Please, find my comments below:

On Mon, 14 Feb 2022 at 17:34, Chris Egerton 
wrote:

> Hi Jorge,
>
> Thanks for the KIP! I'd love to see support for nested fields added to the
> out-of-the-box SMTs provided with Connect. Here are my initial thoughts:
>
> 1. I agree that there's a case to be made for expanding HoistField with a
> new config property for identifying a nested, to-be-hoisted field, but the
> example in the KIP doesn't really demonstrate why this would be valuable. I
> think it'd be helpful to expand the example to add other fields in order to
> show how adding nested field support enables users to hoist a nested field
> without dropping other fields from the value. Maybe something like this:
>
> source = nested.val
> field = line
>
> value (before):
> {
> "nested": {
> "val": 42,
> "other val": 96
> }
> }
>
> value (after):
> {
> "nested": {
> "line": {
> "val": 42,
> }
> "other val": 96
> }
> }
>
> 2. Nit: I think "source" is a little strange for the new HoistField
> property name. Maybe "hoisted" or "hoisted.field" would be more
> descriptive?
>
>
About 1. and 2.:
Agree. The example for this SMT is updated and have added the `hoisted`
configuration.


> 3. Is there a reasonable use case for expanding Flatten to be able to
> flatten specific fields? My understanding is that it's mostly useful for
> writing to systems like databases that don't support nested values and
> require everything to be a flat list of key-value pairs. Being able to
> flatten a nested field wouldn't provide any advantage for that use case.
> Are there other cases where it would?
>
> 4. I don't think we should unconditionally change the default delimiter for
> Flatten. It's a backwards-incompatible, breaking change that could cause
> headaches for users. It might be reasonable to change the default value
> dynamically based on whether the user has specified a value for the "field"
> property, but considering the motivation for changing the default is that
> it creates conflicts with the to-be-introduced nested field syntax (which
> could arise with downstream SMTs regardless of whether the user has
> explicitly configured Flatten with the "field" property), I don't know that
> this would be too useful either. I have some thoughts below on how to
> handle possible conflicts between names with dots in their names and dotted
> syntax for nested field references that should hopefully make either change
> unnecessary.
>
>
Fair enough. With the support for nested fields in other SMTs, Flatten
could stay as it is.
This removes the need for (4) changing Flatten config as well.


> 5. I think it's fine to expand ExtractField to support nested notation, but
> it might be worth noting in the rejected alternatives section that this
> isn't strictly necessary since you can replace any single invocation of
> that SMT that uses nested field notation with multiple invocations of it
> that use non-nested notation.
>

Agree. Adding it.


>
> 6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs that do
> not require nested structure support.
>
>
Ack. Fixing it.


> 7. It may be rare for dots in field names to occur in the wild (although I
> wouldn't be so certain of this), but unless we want to inflict headaches on
> users of Flatten, I think we're going to have to think about conflicts
> between dotted notation and non-nested fields whose names contain dots. I
> don't think this is actually such a bad thing, though. I agree that dotted
> notation is intuitive and pretty commonplace (in tools like jq, for
> example), so I'd like it if we could stick to it. What about introducing
> escape syntax, using a backslash? That way, users could disambiguate
> between "this.field" (which would refer to the nested field "field" under
> the top-level "this" field), and "this\.field" (which would refer to the
> field named "this.field"). Like with most languages that use the backslash
> for escape sequences, it could also be used to escape itself, in the event
> that a field name contains a backslash. I think this is more intuitive and
> simpler than, e.g., adding a new config property to toggle the delimiter to
> be used when parsing nested field references.
>

I like this approach. Adding to the KIP.


>
> 8. I don't think we can unconditionally turn this feature on. The risk of
> breaking existing pipelines (especially ones that involve, for example, a
> combination of the Flatten and Cast SMTs) is pretty high. I think this
> should be an opt-in feature, at least until the next major release. One way
> we could accomplish this is by introducing a new "field.style" (name
> obviously subject to change) property with values of "plain" (default) and
> "nested". If set to "plain" then the 

Re: [DISCUSS] KIP-821: Connect Transforms support for nested structures

2022-02-14 Thread Chris Egerton
Hi Jorge,

Thanks for the KIP! I'd love to see support for nested fields added to the
out-of-the-box SMTs provided with Connect. Here are my initial thoughts:

1. I agree that there's a case to be made for expanding HoistField with a
new config property for identifying a nested, to-be-hoisted field, but the
example in the KIP doesn't really demonstrate why this would be valuable. I
think it'd be helpful to expand the example to add other fields in order to
show how adding nested field support enables users to hoist a nested field
without dropping other fields from the value. Maybe something like this:

source = nested.val
field = line

value (before):
{
"nested": {
"val": 42,
"other val": 96
}
}

value (after):
{
"nested": {
"line": {
"val": 42,
}
"other val": 96
}
}

2. Nit: I think "source" is a little strange for the new HoistField
property name. Maybe "hoisted" or "hoisted.field" would be more descriptive?

3. Is there a reasonable use case for expanding Flatten to be able to
flatten specific fields? My understanding is that it's mostly useful for
writing to systems like databases that don't support nested values and
require everything to be a flat list of key-value pairs. Being able to
flatten a nested field wouldn't provide any advantage for that use case.
Are there other cases where it would?

4. I don't think we should unconditionally change the default delimiter for
Flatten. It's a backwards-incompatible, breaking change that could cause
headaches for users. It might be reasonable to change the default value
dynamically based on whether the user has specified a value for the "field"
property, but considering the motivation for changing the default is that
it creates conflicts with the to-be-introduced nested field syntax (which
could arise with downstream SMTs regardless of whether the user has
explicitly configured Flatten with the "field" property), I don't know that
this would be too useful either. I have some thoughts below on how to
handle possible conflicts between names with dots in their names and dotted
syntax for nested field references that should hopefully make either change
unnecessary.

5. I think it's fine to expand ExtractField to support nested notation, but
it might be worth noting in the rejected alternatives section that this
isn't strictly necessary since you can replace any single invocation of
that SMT that uses nested field notation with multiple invocations of it
that use non-nested notation.

6. Nit: "RegerRouter" should be "RegexRouter" in the list of SMTs that do
not require nested structure support.

7. It may be rare for dots in field names to occur in the wild (although I
wouldn't be so certain of this), but unless we want to inflict headaches on
users of Flatten, I think we're going to have to think about conflicts
between dotted notation and non-nested fields whose names contain dots. I
don't think this is actually such a bad thing, though. I agree that dotted
notation is intuitive and pretty commonplace (in tools like jq, for
example), so I'd like it if we could stick to it. What about introducing
escape syntax, using a backslash? That way, users could disambiguate
between "this.field" (which would refer to the nested field "field" under
the top-level "this" field), and "this\.field" (which would refer to the
field named "this.field"). Like with most languages that use the backslash
for escape sequences, it could also be used to escape itself, in the event
that a field name contains a backslash. I think this is more intuitive and
simpler than, e.g., adding a new config property to toggle the delimiter to
be used when parsing nested field references.

8. I don't think we can unconditionally turn this feature on. The risk of
breaking existing pipelines (especially ones that involve, for example, a
combination of the Flatten and Cast SMTs) is pretty high. I think this
should be an opt-in feature, at least until the next major release. One way
we could accomplish this is by introducing a new "field.style" (name
obviously subject to change) property with values of "plain" (default) and
"nested". If set to "plain" then the current non-nested behavior is used,
and if set to "nested", then the proposed nested behavior is. We can
consider updating the default value to "nested" in a future major release
(or even codify that plan in this KIP, if there's enough support for it).
This would also leave the door open for adding recursive support to SMTs in
the future by adding a permitted value of "recursive".

9. One of the linked tickets in the "Motivation" section,
https://issues.apache.org/jira/browse/KAFKA-10640, has an open PR and KIP
that propose adding recursive support to some SMTs. Have you considered
this type of functionality for your KIP? Or is your aim to stick solely to