RE: Add support of time logical type with nanoseconds precision

Clemens Vasters Thu, 09 Jan 2025 01:36:26 -0800

The JSON Number BNF allows longer numbers, but RFC8259 Section 6 gives everyone 
a free pass to back JSON numbers in implementations with IEEE754 doubles. "This 
specification allows" is basically a normative limit. That means that such 
implementations are not wrong, but they are simply doing what the spec 
explicitly lets them do.


" This specification allows implementations to set limits on the range
   and precision of numbers accepted.  Since software that implements
   IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
   available and widely used, good interoperability can be achieved by
   implementations that expect no more precision or range than these
   provide, in the sense that implementations will approximate JSON
   numbers within the expected precision."

You are correct that the entire mapping of long is problematic, but for 
nano-timestamps, it goes as far as every _value_ around the present day being 
outside the spec limit. 

You are also correct that the spec mandates for the decoding party to be in 
possession of the schema. That requirement severely limits the usefulness of 
the JSON encoding and is actively causing problems since developers approach 
JSON features with the expectation that the output is interoperable and can be 
used wherever JSON can be handled. 

It is my assessment that Avro Schema is very well suited as a general-purpose 
schema language for defining data structures. My Avrotize tool 
(https://github.com/clemensv/avrotize/) proves that Avro Schema's structure and 
extensibility is a great "middle ground" for conversions between all sorts of 
different schema models, with the extra benefit of the schemas being usable 
with the Avro serialization framework. At Microsoft, we are using Avro schema 
with a handful of annotation extensions (see 
https://github.com/clemensv/avrotize/blob/master/specs/avrotize-schema.md) as 
the canonical schema model inside of Microsoft Fabric's data streaming features 
since we can't tool for a dozen different schema formats and the popular JSON 
Schema is absolutely awful to write tooling around.

It is also my assessment that the JSON encoding defined for the Avro 
serialization framework is unusable for interoperability scenarios, not only 
related to the issue at hand. If you give any developer who has ever written a 
JSON document an Avro schema to look at and ask them to craft a JSON document 
that conforms with that schema, they will create a document that any other 
developer who looks at document and schema will nod at and say "looks right". 
That document will yet be vastly different from the structure the Avro spec 
asks for. 

We've done this exercise with quite a few folks inside the company, but just to 
underline that point, I just asked ChatGPT (o1 model) as one of those 
"developers":

"create a JSON document conformant with this schema"

{
  "type": "record",
  "namespace": "com.example.recipes",
  "name": "Recipe",
  "doc": "Avro schema for describing a cooking recipe.",
  "fields": [
    {
      "name": "name",
      "type": "string",
      "doc": "Name of the recipe."
    },
    {
      "name": "ingredients",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "Ingredient",
          "doc": "Describes an ingredient and its quantity.",
          "fields": [
            {
              "name": "item",
              "type": "string",
              "doc": "Ingredient name."
            },
            {
              "name": "quantity",
              "type": "string",
              "doc": "Amount of the ingredient."
            }
          ]
        }
      },
      "doc": "List of ingredients."
    },
    {
      "name": "instructions",
      "type": {
        "type": "array",
        "items": "string"
      },
      "doc": "Cooking steps."
    },
    {
      "name": "servings",
      "type": "int",
      "doc": "Number of servings produced."
    },
    {
      "name": "prepTimeMinutes",
      "type": "int",
      "doc": "Minutes of preparation time."
    },
    {
      "name": "cookTimeMinutes",
      "type": "int",
      "doc": "Minutes of cooking time."
    }
  ]
}

The answer is unsurprisingly miles away from how the Avro spec wants it:

{
  "name": "Chocolate Cake",
  "ingredients": [
    {
      "item": "Flour",
      "quantity": "2 cups"
    },
    {
      "item": "Sugar",
      "quantity": "1.5 cups"
    },
    {
      "item": "Cocoa Powder",
      "quantity": "3/4 cup"
    },
    {
      "item": "Eggs",
      "quantity": "2"
    }
  ],
  "instructions": [
    "Preheat the oven to 350°F",
    "Grease a round cake pan",
    "Combine dry ingredients and mix well",
    "Add eggs and stir to form a batter",
    "Pour batter into the pan and bake for 30 minutes"
  ],
  "servings": 8,
  "prepTimeMinutes": 15,
  "cookTimeMinutes": 30
}

Now, for fun, I also asked it "create a JSON document conformant with this 
schema per the rules of the Avro JSON encoding" and it came up with literally 
the same document.



-----Original Message-----
From: glywk <[email protected]> 
Sent: Thursday, January 9, 2025 6:40 AM
To: [email protected]
Subject: Re: Add support of time logical type with nanoseconds precision

[Sie erhalten nicht häufig E-Mails von [email protected]. Weitere 
Informationen, warum dies wichtig ist, finden Sie unter 
https://aka.ms/LearnAboutSenderIdentification ]

About your timestamp remarks, the current Avro JSON encoding specification 
makes JSON:


   - not deserializable without the schema as described in "JSON
   Encoding"[1] part.


*"Note that the original schema is still required to correctly process 
JSON-encoded data."*

   - not easily human readable partially due to logical type
   serialisation[1]

*"A logical type is always serialized using its underlying Avro type so that 
values are encoded in exactly the same way as the equivalent Avro type that 
does not have a logicalType attribute."*

So, the interoperability problem you mentioned is not about timestamp but all 
fields based on long type because they are stored in memory on 64-bits signed 
integer.

As I interpret the RFC 8259 Section 6 [2] BNF grammar, number limit and 
precision are not limited. So Avro long type does not break the RFC. But as 
suggested some implementation limited to IEEE 754 ranges to express integers 
may be wrong.

[1] https://avro.apache.org/docs/1.12.0/specification
<https://avro.apache.org/docs/1.12.0/specification/#aliases>
[2] https://www.rfc-editor.org/rfc/rfc8259#section-6

Regards

Le mer. 8 janv. 2025 à 14:54, Clemens Vasters <[email protected]> 
a écrit :

> I agree with you proposal staying within the range. However, you 
> propose this to align with the nanosecond timestamps and those are broken for 
> JSON.
>
> Your proposal called that to my attention.
>
> Clemens
>
> -----Original Message-----
> From: glywk <[email protected]>
> Sent: Wednesday, January 8, 2025 12:15 AM
> To: [email protected]
> Subject: Re: Add support of time logical type with nanoseconds 
> precision
>
> [Sie erhalten nicht häufig E-Mails von [email protected]. 
> Weitere Informationen, warum dies wichtig ist, finden Sie unter 
> https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi,
>
> Your analysis is interesting but about timestamp. My proposal is about 
> adding nanoseconds support on time logical type. As described in 
> AVRO-4043 [1], the maximum value of time is 8.64E13. This value 
> doesn't exceeded the upper range value 2^53-1 recommended for common 
> interoperability with IEEE
> 754 floating point representation.
>
> [1]  
> https://issu
> es.apache.org%2Fjira%2Fbrowse%2FAVRO-4043&data=05%7C02%7Cclemensv%40mi
> crosoft.com%7C5ba379718c234bf9d82608dd30701de8%7C72f988bf86f141af91ab2
> d7cd011db47%7C1%7C0%7C638719980313339875%7CUnknown%7CTWFpbGZsb3d8eyJFb
> XB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=isoVHgOvtrllHkh4hcEsZXvcf%2Bz3e9ME87Zz%2F3WjAgY%3D&reserved=0
>
> Regards
>

JSON encoding issues // RE: Add support of time logical type with nanoseconds precision

Reply via email to