Thanks Almog for preparing this KIP!
I think it will improve usability and troubleshooting with JSON data a lot.

The finalized plan seems quite concrete now. I also liked that some
implementation specific implications (such as setting the ObjectMapper to
deserialize floating point as BigDecimal) are highlighted in the KIP.

Still, as I was reading the KIP, the main obstacle I encountered was around
terminology. I couldn't get used to reading "producer" and "consumer" and
not thinking in terms of Kafka producers and consumers - which are not
relevant to what this KIP proposes. Thus, I'd suggest replacing
"Producer(s)" with "Source Converter(s)" and "Consumer(s)" with "Sink
Converter(s)" (even if "Converter used by Source Connector" or "Converter
used by Sink Connector" would be even more accurate - maybe this could be
an explanation in a footnote). Terminology around converters has been
tricky in the past and adding producers/consumers in the mix might add to
the confusion.

Another example where I'd apply this different terminology would be to a
phrase such as the following:
"Because of this, users must take care to first ensure that all consumers
have upgraded to the new code before upgrading producers to make use of the
NUMERIC serialization format."
which I'd write
"Because of this, users must take care to first ensure that all sink
connectors have upgraded to the new converter code before upgrading source
connectors to make use of the NUMERIC serialization format in
JsonConverter."

Let me know if you think this suggestion makes the KIP easier to follow.
Otherwise I think it's a solid proposal.

I'm concluding with a couple of nits:

- "Upgraded Producer with BASE64 serialization, Legacy Consumer: this
scenario is okay as the upgraded ~producer~ consumer will be able to read
binary as today" (again according to my suggestion above, it could be as
the upgraded source converter ...)

- "consumers cannot consumer NUMERIC data. " -> "consumers cannot read
NUMERIC data"

Best,
Konstantine

On Fri, Aug 9, 2019 at 6:37 PM Almog Gavra <al...@confluent.io> wrote:

> Good catches! Fixed :)
>
> On Thu, Aug 8, 2019 at 10:36 PM Arjun Satish <arjun.sat...@gmail.com>
> wrote:
>
> > Cool!
> >
> > Couple of nits:
> >
> > - In public interfaces, typo: *json.decimal.serialization.fromat*
> > - In public interfaces, you use the term "HEX" instead of "BASE64".
> >
> >
> >
> > On Wed, Aug 7, 2019 at 9:51 AM Almog Gavra <al...@confluent.io> wrote:
> >
> > > EDIT: everywhere I've been using "HEX" I meant to be using "BASE64". I
> > will
> > > update the KIP to reflect this.
> > >
> > > On Wed, Aug 7, 2019 at 9:44 AM Almog Gavra <al...@confluent.io> wrote:
> > >
> > > > Thanks for the feedback Arjun! I'm happy changing the default config
> to
> > > > HEX instead of BINARY, no strong feelings there.
> > > >
> > > > I'll also clarify the example in the KIP to be clearer:
> > > >
> > > > - serialize the decimal field "foo" with value "10.2345" with the HEX
> > > > setting: {"foo": "D3J5"}
> > > > - serialize the decimal field "foo" with value "10.2345" with the
> > NUMERIC
> > > > setting: {"foo": 10.2345}
> > > >
> > > > With regards to the precision issue, that was my original concern as
> > well
> > > > (and why I originally suggested a TEXT format). Many JSON
> deserializers
> > > > (e.g. Jackson with
> DeserializationFeature.USE_BIG_DECIMAL_FOR_FLOATS),
> > > > however, have the ability to deserialize decimals correctly so I will
> > > > configure that as the default for Connect's JsonDeserializer. It's
> > > probably
> > > > a good idea to call out that using other deserializers must be done
> > with
> > > > care - I will add that documentation to the serialization config.
> > > >
> > > > Note that there would not be an issue on the _serialization_ side of
> > > > things as Jackson respects BigDecimal.
> > > >
> > > > Almog
> > > >
> > > > On Tue, Aug 6, 2019 at 11:23 PM Arjun Satish <arjun.sat...@gmail.com
> >
> > > > wrote:
> > > >
> > > >> hey Almog, nice work! couple of thoughts (hope I'm not late since
> you
> > > >> started the voting thread already):
> > > >>
> > > >> can you please add some examples to show the changes that you are
> > > >> proposing. makes me think that for a given decimal number, we will
> > have
> > > >> two
> > > >> encodings: “asHex” and “asNumber”.
> > > >>
> > > >> should we call the default config value “HEX” instead of “BINARY”?
> > > >>
> > > >> Should we call out the fact that JS systems might be susceptible to
> > > double
> > > >> precision round offs with the new numeric format? here are some
> people
> > > >> discussing a similar problem
> > > >> https://github.com/EventStore/EventStore/issues/1541
> > > >>
> > > >> On Tue, Aug 6, 2019 at 1:40 PM Almog Gavra <al...@confluent.io>
> > wrote:
> > > >>
> > > >> > Hello Everyone,
> > > >> >
> > > >> > Summarizing an in-person discussion with Randall (this is copied
> > from
> > > >> the
> > > >> > KIP):
> > > >> >
> > > >> > The original KIP suggested supporting an additional
> representation -
> > > >> base10
> > > >> > encoded text (e.g. `{"asText":"10.2345"}`). This causes issues
> > because
> > > >> it
> > > >> > is impossible to disambiguate between TEXT and BINARY without an
> > > >> additional
> > > >> > config - furthermore, this makes the migration from one to the
> other
> > > >> nearly
> > > >> > impossible because it would require that all consumers stop
> > consuming
> > > >> and
> > > >> > producers stop producing and atomically updating the config on all
> > of
> > > >> them
> > > >> > after deploying the new code, or waiting for the full retention
> > period
> > > >> to
> > > >> > pass - neither option is viable. The suggestion in the KIP is
> > strictly
> > > >> an
> > > >> > improvement over the existing behavior, even if it doesn't support
> > all
> > > >> > combinations.
> > > >> >
> > > >> > It seems that since most real-world use cases actually use the
> > numeric
> > > >> > representation (not string) we can consider this an improvement.
> > With
> > > >> the
> > > >> > new suggestion, we don't need a deserialization configuration
> (only
> > a
> > > >> > serialization option) and the consumers will be able to always
> > > >> > automatically determine the serialization format.
> > > >> >
> > > >> > Based on this, I'll be opening up the simplified version of the
> KIP
> > > to a
> > > >> > vote.
> > > >> >
> > > >> > Almog
> > > >> >
> > > >> > On Mon, Jul 29, 2019 at 9:29 AM Almog Gavra <al...@confluent.io>
> > > wrote:
> > > >> >
> > > >> > > I'm mostly happy with your current suggestion (two configs, one
> > for
> > > >> > > serialization and one for deserialization) and your
> implementation
> > > >> > > suggestion. One thing to note:
> > > >> > >
> > > >> > > > We should _always_ be able to deserialize a standard JSON
> > > >> > > > number as a decimal
> > > >> > >
> > > >> > > I was doing some research into decimals and JSON, and I can
> > imagine
> > > a
> > > >> > > compelling reason to require string representations to avoid
> > losing
> > > >> > > precision and to be certain that whomever is sending the data
> > isn't
> > > >> > losing
> > > >> > > precision (e.g. https://stackoverflow.com/a/38357877/2258040).
> > > >> > >
> > > >> > > I'm okay with always allowing numerics, but thought it's worth
> > > raising
> > > >> > the
> > > >> > > thought.
> > > >> > >
> > > >> > > On Mon, Jul 29, 2019 at 4:57 AM Andy Coates <a...@confluent.io>
> > > >> wrote:
> > > >> > >
> > > >> > >> The way I see it, we need to control two seperate things:
> > > >> > >>
> > > >> > >> 1. How do we _deserialize_ a decimal type if we encounter a
> text
> > > >> node in
> > > >> > >> the JSON?    (We should _always_ be able to deserialize a
> > standard
> > > >> JSON
> > > >> > >> number as a decimal).
> > > >> > >> 2. How do we chose how we want decimals to be _serialized_.
> > > >> > >>
> > > >> > >> This looks to fits well with your second suggestion of slightly
> > > >> > different
> > > >> > >> configs names for serialization vs deserialization.
> > > >> > >> a, For deserialization we only care about how to handle text
> > > nodes: `
> > > >> > >> deserialization.decimal.*text*.format`, which should only have
> > two
> > > >> valid
> > > >> > >> values BINARY | TEXT.
> > > >> > >> b. For serialization we need all three:
> > > >> `serialization.decimal.format`,
> > > >> > >> which should support all three options: BINARY | TEXT |
> NUMERIC.
> > > >> > >>
> > > >> > >> Implementation wise, I think these should be two separate
> enums,
> > > >> rather
> > > >> > >> than one shared enum and throwing an error if the deserializer
> is
> > > >> set to
> > > >> > >> NUMERIC.  Mainly as this means the enums reflect the options
> > > >> available,
> > > >> > >> rather than this being hidden in config checking code.  But
> > that's
> > > a
> > > >> > minor
> > > >> > >> implementation detail.
> > > >> > >>
> > > >> > >> Personally, I'd be tempted to have the BINARY value named
> > something
> > > >> like
> > > >> > >> `LEGACY` or `LEGACY_BINARY` as a way of encouraging users to
> move
> > > >> away
> > > >> > >> from
> > > >> > >> it.
> > > >> > >>
> > > >> > >> It's a real shame that both of these settings require a default
> > of
> > > >> > BINARY
> > > >> > >> for backwards compatibility, but I agree that discussions /
> plans
> > > >> around
> > > >> > >> switching the defaults should not block this KIP.
> > > >> > >>
> > > >> > >> Andy
> > > >> > >>
> > > >> > >>
> > > >> > >> On Thu, 25 Jul 2019 at 18:26, Almog Gavra <al...@confluent.io>
> > > >> wrote:
> > > >> > >>
> > > >> > >> > Thanks for the replies Andy and Andrew (2x Andy?)!
> > > >> > >> >
> > > >> > >> > > Is the text decimal a base16 encoded number, or is it
> base16
> > > >> encoded
> > > >> > >> > binary
> > > >> > >> > > form of the number?
> > > >> > >> >
> > > >> > >> > The conversion happens as
> decimal.unscaledValue().toByteArray()
> > > and
> > > >> > then
> > > >> > >> > the byte array is converted to a hex string, so it's
> definitely
> > > the
> > > >> > >> binary
> > > >> > >> > form of the number converted to base16. Whether or not that's
> > the
> > > >> same
> > > >> > >> as
> > > >> > >> > the base16 encoded number is a good question (toByteArray
> > > returns a
> > > >> > byte
> > > >> > >> > array containing a signed, big-endian, two's complement
> > > >> representation
> > > >> > >> of
> > > >> > >> > the big integer).
> > > >> > >> >
> > > >> > >> > > One suggestion I have is to change the proposed new config
> to
> > > >> only
> > > >> > >> affect
> > > >> > >> > > decimals stored as text, i.e. to switch between the current
> > > >> base16
> > > >> > and
> > > >> > >> > the
> > > >> > >> > > more common base10.   Then add another config to the
> > serializer
> > > >> only
> > > >> > >> that
> > > >> > >> > > controls if decimals should be serialized as text or
> numeric.
> > > >> > >> >
> > > >> > >> > I think we need to be able to handle all mappings from
> > > >> serialization
> > > >> > >> format
> > > >> > >> > to deserialization format (e.g. read in BINARY and output
> > TEXT),
> > > >> > which I
> > > >> > >> > think would be impossible with the alternative suggestion. I
> > > agree
> > > >> > that
> > > >> > >> > automatically deserializing numerics is valuable. I see two
> > other
> > > >> ways
> > > >> > >> to
> > > >> > >> > get this, both keeping the serialization.format config the
> > same:
> > > >> > >> >
> > > >> > >> > - have json.decimal.deserialization.format accept all three
> > > >> formats.
> > > >> > if
> > > >> > >> set
> > > >> > >> > to BINARY/TEXT, numerics would be automatically supported. If
> > set
> > > >> to
> > > >> > >> > NUMERIC, then any string coming in would result in
> > > deserialization
> > > >> > error
> > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > >> > >> > - change json.decimal.deserialization.format to
> > > >> > >> > json.decimal.deserialization.string.format which accepts only
> > > >> > >> BINARY/TEXT
> > > >> > >> > (defaults to BINARY for backwards compatibility)
> > > >> > >> >
> > > >> > >> > > would be a breaking change in that things that previously
> > > failed
> > > >> > would
> > > >> > >> > > suddenly start deserializing.  This is a price I'm willing
> to
> > > >> pay.
> > > >> > >> >
> > > >> > >> > I agree. I'm willing to pay this price too.
> > > >> > >> >
> > > >> > >> > > IMHO, we should then plan to switch the default of decimal
> > > >> > >> serialization
> > > >> > >> > to
> > > >> > >> > > numeric, and text serialization to base 10 in the next
> major
> > > >> > release.
> > > >> > >> >
> > > >> > >> > I think that can be a separate discussion, I don't want to
> > block
> > > >> this
> > > >> > >> KIP
> > > >> > >> > on it.
> > > >> > >> >
> > > >> > >> > Thoughts?
> > > >> > >> >
> > > >> > >> > On Thu, Jul 25, 2019 at 6:35 AM Andrew Otto <
> > o...@wikimedia.org>
> > > >> > wrote:
> > > >> > >> >
> > > >> > >> > > This is a bit orthogonal, but in JsonSchemaConverter I use
> > > >> > >> JSONSchemas to
> > > >> > >> > > indicate whether a JSON number should be deserialized as an
> > > >> integer
> > > >> > >> or a
> > > >> > >> > > decimal
> > > >> > >> > > <
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> https://github.com/ottomata/kafka-connect-jsonschema/blob/master/src/main/java/org/wikimedia/kafka/connect/jsonschema/JsonSchemaConverter.java#L251-L261
> > > >> > >> > > >.
> > > >> > >> > > Not everyone is going to have JSONSchemas available when
> > > >> converting,
> > > >> > >> but
> > > >> > >> > if
> > > >> > >> > > you do, it is an easy way to support JSON numbers as
> > decimals.
> > > >> > >> > >
> > > >> > >> > > Carry on! :)
> > > >> > >> > >
> > > >> > >> > > On Thu, Jul 25, 2019 at 9:12 AM Andy Coates <
> > a...@confluent.io
> > > >
> > > >> > >> wrote:
> > > >> > >> > >
> > > >> > >> > > > Hi Almog,
> > > >> > >> > > >
> > > >> > >> > > > Like the KIP - I think being able to support decimals in
> > JSON
> > > >> in
> > > >> > the
> > > >> > >> > same
> > > >> > >> > > > way most other systems do is a great improvement.
> > > >> > >> > > >
> > > >> > >> > > > It's not 100% clear to me from the KIP what the current
> > > format
> > > >> is.
> > > >> > >> Is
> > > >> > >> > > the
> > > >> > >> > > > text decimal a base16 encoded number, or is it base16
> > encoded
> > > >> > binary
> > > >> > >> > form
> > > >> > >> > > > of the number? (I've not tried to get my head around if
> > these
> > > >> two
> > > >> > >> are
> > > >> > >> > > even
> > > >> > >> > > > different!)
> > > >> > >> > > >
> > > >> > >> > > > One suggestion I have is to change the proposed new
> config
> > to
> > > >> only
> > > >> > >> > affect
> > > >> > >> > > > decimals stored as text, i.e. to switch between the
> current
> > > >> base16
> > > >> > >> and
> > > >> > >> > > the
> > > >> > >> > > > more common base10.   Then add another config to the
> > > serialzier
> > > >> > only
> > > >> > >> > that
> > > >> > >> > > > controls if decimals should be serialized as text or
> > numeric.
> > > >> The
> > > >> > >> > > benefit
> > > >> > >> > > > of this approach is it allows us to enhance the
> > deserializer
> > > to
> > > >> > >> > > > automatically handle numeric decimals even without any
> > config
> > > >> > >> having to
> > > >> > >> > > be
> > > >> > >> > > > set, i.e. default config in the deserializer would be
> able
> > to
> > > >> > handle
> > > >> > >> > > > numeric decimals.  Of course, this is a two edged sword:
> > this
> > > >> > would
> > > >> > >> > make
> > > >> > >> > > > the deserializer work out of the box with numeric
> decimals,
> > > >> > (yay!),
> > > >> > >> but
> > > >> > >> > > > would be a breaking change in that things that previously
> > > >> failed
> > > >> > >> would
> > > >> > >> > > > suddenly start deserializing.  This is a price I'm
> willing
> > to
> > > >> pay.
> > > >> > >> > > >
> > > >> > >> > > > IMHO, we should then plan to switch the default of
> decimal
> > > >> > >> > serialization
> > > >> > >> > > to
> > > >> > >> > > > numeric, and text serialization to base 10 in the next
> > major
> > > >> > >> release.
> > > >> > >> > > > (With upgrade notes to match). Though I know this is more
> > > >> > >> contentious,
> > > >> > >> > I
> > > >> > >> > > > think it moves us forward in a much more standard way
> that
> > > the
> > > >> > >> current
> > > >> > >> > > > encoding of decimals.
> > > >> > >> > > >
> > > >> > >> > > > On Tue, 25 Jun 2019 at 01:03, Almog Gavra <
> > > al...@confluent.io>
> > > >> > >> wrote:
> > > >> > >> > > >
> > > >> > >> > > > > Hi Everyone!
> > > >> > >> > > > >
> > > >> > >> > > > > Kicking off discussion for a new KIP:
> > > >> > >> > > > >
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-481%3A+SerDe+Improvements+for+Connect+Decimal+type+in+JSON
> > > >> > >> > > > >
> > > >> > >> > > > > For those who are interested, I have a prototype
> > > >> implementation
> > > >> > >> that
> > > >> > >> > > > helped
> > > >> > >> > > > > guide my design:
> https://github.com/agavra/kafka/pull/1
> > > >> > >> > > > >
> > > >> > >> > > > > Cheers,
> > > >> > >> > > > > Almog
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >> >
> > > >> > >>
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to