RE: Add support of time logical type with nanoseconds precision

glywk Fri, 10 Jan 2025 04:23:09 -0800

I share your experience with Avro JSON encoding. I use a lot of decimal
with different precision and many dates representation. We have try to use
the JSON view to write test Avro document but it is inefficient and error
prone to write:


   - timestamp of (xxx)seconds from the unix epoch, 1 January 1970
   00:00:00.(xxx) UTC
   - Decimal number in unscaled integer value in big-endian byte order
   correctly encoded in unicode
   - fields in exact order defined in the schema

I agree that JSON should evolve to a more human centric approach.
Integrates in JSON
 encoding specification the capability to serialize/deserialize Logical
Type by either Avro type or a better JSON representation would be
appreciate.


Le jeu. 9 janv. 2025 à 10:36, Clemens Vasters
<cleme...@microsoft.com.invalid> a écrit :

> The JSON Number BNF allows longer numbers, but RFC8259 Section 6 gives
> everyone a free pass to back JSON numbers in implementations with IEEE754
> doubles. "This specification allows" is basically a normative limit. That
> means that such implementations are not wrong, but they are simply doing
> what the spec explicitly lets them do.
>
> " This specification allows implementations to set limits on the range
>    and precision of numbers accepted.  Since software that implements
>    IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
>    available and widely used, good interoperability can be achieved by
>    implementations that expect no more precision or range than these
>    provide, in the sense that implementations will approximate JSON
>    numbers within the expected precision."
>
> You are correct that the entire mapping of long is problematic, but for
> nano-timestamps, it goes as far as every _value_ around the present day
> being outside the spec limit.
>
> You are also correct that the spec mandates for the decoding party to be
> in possession of the schema. That requirement severely limits the
> usefulness of the JSON encoding and is actively causing problems since
> developers approach JSON features with the expectation that the output is
> interoperable and can be used wherever JSON can be handled.
>
> It is my assessment that Avro Schema is very well suited as a
> general-purpose schema language for defining data structures. My Avrotize
> tool (https://github.com/clemensv/avrotize/) proves that Avro Schema's
> structure and extensibility is a great "middle ground" for conversions
> between all sorts of different schema models, with the extra benefit of the
> schemas being usable with the Avro serialization framework. At Microsoft,
> we are using Avro schema with a handful of annotation extensions (see
> https://github.com/clemensv/avrotize/blob/master/specs/avrotize-schema.md)
> as the canonical schema model inside of Microsoft Fabric's data streaming
> features since we can't tool for a dozen different schema formats and the
> popular JSON Schema is absolutely awful to write tooling around.
>
> It is also my assessment that the JSON encoding defined for the Avro
> serialization framework is unusable for interoperability scenarios, not
> only related to the issue at hand. If you give any developer who has ever
> written a JSON document an Avro schema to look at and ask them to craft a
> JSON document that conforms with that schema, they will create a document
> that any other developer who looks at document and schema will nod at and
> say "looks right". That document will yet be vastly different from the
> structure the Avro spec asks for.
>
> We've done this exercise with quite a few folks inside the company, but
> just to underline that point, I just asked ChatGPT (o1 model) as one of
> those "developers":
>
> "create a JSON document conformant with this schema"
>
> {
>   "type": "record",
>   "namespace": "com.example.recipes",
>   "name": "Recipe",
>   "doc": "Avro schema for describing a cooking recipe.",
>   "fields": [
>     {
>       "name": "name",
>       "type": "string",
>       "doc": "Name of the recipe."
>     },
>     {
>       "name": "ingredients",
>       "type": {
>         "type": "array",
>         "items": {
>           "type": "record",
>           "name": "Ingredient",
>           "doc": "Describes an ingredient and its quantity.",
>           "fields": [
>             {
>               "name": "item",
>               "type": "string",
>               "doc": "Ingredient name."
>             },
>             {
>               "name": "quantity",
>               "type": "string",
>               "doc": "Amount of the ingredient."
>             }
>           ]
>         }
>       },
>       "doc": "List of ingredients."
>     },
>     {
>       "name": "instructions",
>       "type": {
>         "type": "array",
>         "items": "string"
>       },
>       "doc": "Cooking steps."
>     },
>     {
>       "name": "servings",
>       "type": "int",
>       "doc": "Number of servings produced."
>     },
>     {
>       "name": "prepTimeMinutes",
>       "type": "int",
>       "doc": "Minutes of preparation time."
>     },
>     {
>       "name": "cookTimeMinutes",
>       "type": "int",
>       "doc": "Minutes of cooking time."
>     }
>   ]
> }
>
> The answer is unsurprisingly miles away from how the Avro spec wants it:
>
> {
>   "name": "Chocolate Cake",
>   "ingredients": [
>     {
>       "item": "Flour",
>       "quantity": "2 cups"
>     },
>     {
>       "item": "Sugar",
>       "quantity": "1.5 cups"
>     },
>     {
>       "item": "Cocoa Powder",
>       "quantity": "3/4 cup"
>     },
>     {
>       "item": "Eggs",
>       "quantity": "2"
>     }
>   ],
>   "instructions": [
>     "Preheat the oven to 350°F",
>     "Grease a round cake pan",
>     "Combine dry ingredients and mix well",
>     "Add eggs and stir to form a batter",
>     "Pour batter into the pan and bake for 30 minutes"
>   ],
>   "servings": 8,
>   "prepTimeMinutes": 15,
>   "cookTimeMinutes": 30
> }
>
> Now, for fun, I also asked it "create a JSON document conformant with this
> schema per the rules of the Avro JSON encoding" and it came up with
> literally the same document.
>
>
>
> -----Original Message-----
> From: glywk <glywk.cont...@gmail.com>
> Sent: Thursday, January 9, 2025 6:40 AM
> To: dev@avro.apache.org
> Subject: Re: Add support of time logical type with nanoseconds precision
>
> [Sie erhalten nicht häufig E-Mails von glywk.cont...@gmail.com. Weitere
> Informationen, warum dies wichtig ist, finden Sie unter
> https://aka.ms/LearnAboutSenderIdentification ]
>
> About your timestamp remarks, the current Avro JSON encoding specification
> makes JSON:
>
>
>    - not deserializable without the schema as described in "JSON
>    Encoding"[1] part.
>
>
> *"Note that the original schema is still required to correctly process
> JSON-encoded data."*
>
>    - not easily human readable partially due to logical type
>    serialisation[1]
>
> *"A logical type is always serialized using its underlying Avro type so
> that values are encoded in exactly the same way as the equivalent Avro type
> that does not have a logicalType attribute."*
>
> So, the interoperability problem you mentioned is not about timestamp but
> all fields based on long type because they are stored in memory on 64-bits
> signed integer.
>
> As I interpret the RFC 8259 Section 6 [2] BNF grammar, number limit and
> precision are not limited. So Avro long type does not break the RFC. But as
> suggested some implementation limited to IEEE 754 ranges to express
> integers may be wrong.
>
> [1] https://avro.apache.org/docs/1.12.0/specification
> <https://avro.apache.org/docs/1.12.0/specification/#aliases>
> [2] https://www.rfc-editor.org/rfc/rfc8259#section-6
>
> Regards
>
> Le mer. 8 janv. 2025 à 14:54, Clemens Vasters <cleme...@microsoft.com.invalid>
> a écrit :
>
> > I agree with you proposal staying within the range. However, you
> > propose this to align with the nanosecond timestamps and those are
> broken for JSON.
> >
> > Your proposal called that to my attention.
> >
> > Clemens
> >
> > -----Original Message-----
> > From: glywk <glywk.cont...@gmail.com>
> > Sent: Wednesday, January 8, 2025 12:15 AM
> > To: dev@avro.apache.org
> > Subject: Re: Add support of time logical type with nanoseconds
> > precision
> >
> > [Sie erhalten nicht häufig E-Mails von glywk.cont...@gmail.com.
> > Weitere Informationen, warum dies wichtig ist, finden Sie unter
> > https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Hi,
> >
> > Your analysis is interesting but about timestamp. My proposal is about
> > adding nanoseconds support on time logical type. As described in
> > AVRO-4043 [1], the maximum value of time is 8.64E13. This value
> > doesn't exceeded the upper range value 2^53-1 recommended for common
> > interoperability with IEEE
> > 754 floating point representation.
> >
> > [1]
> > https://issu
> > es.apache.org%2Fjira%2Fbrowse%2FAVRO-4043&data=05%7C02%7Cclemensv%40mi
> > crosoft.com%7C5ba379718c234bf9d82608dd30701de8%7C72f988bf86f141af91ab2
> > d7cd011db47%7C1%7C0%7C638719980313339875%7CUnknown%7CTWFpbGZsb3d8eyJFb
> >
> XB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=isoVHgOvtrllHkh4hcEsZXvcf%2Bz3e9ME87Zz%2F3WjAgY%3D&reserved=0
> >
> > Regards
> >
>

Re: JSON encoding issues // RE: Add support of time logical type with nanoseconds precision

Reply via email to