[GitHub] [avro] RyanSkraba commented on a change in pull request #805: AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase.

GitBox Thu, 18 Jun 2020 01:32:23 -0700


RyanSkraba commented on a change in pull request #805:
URL: https://github.com/apache/avro/pull/805#discussion_r442039489




##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>
+
+        <p>One of defined way to normalize the avro schema using
+          <em>Standard Canonical Form Transformation</em>. This involves
+          stripping unwanted properties and maintain same canonical
+          ordering. The canonical ordering involves ordering avro
+          reserved properties followed by custom properties if mentioned while
+          transforming. Normalization schema which helps to reduce the
+          total memory size of schema (removed unwanted properties and 
whitespace)
+          while transfer avro schema between two system and also reduce the 
parsing
+          time for compatibility check and schema evolution.
+        </p>
+
+        <p><em>Standard Canonical Form</em> is a transformation of a schema
+          into standard canonical ordered. It contains only avro reserved
+          properties <code>"name", "type", "fields", "symbols", "items", 
"values",
+            "logicalType", "size", "order", "doc", "aliases", "default"</code>
+          and <em>other (custom properties)</em> schema properties.
+        </p>
+
+        <section>
+          <title>Transforming into Standard Canonical Form</title>
+
+          <p>Assuming an input schema (in JSON form) that's already
+            UTF-8 text for a <em>valid</em> Avro schema (including all
+            quotes as required by JSON), the following transformations
+            will produce its Standard Canonical Form:</p>
+          <ul>
+            <li> [PRIMITIVES] Convert primitive schemas to their simple
+              form (e.g., <code>int</code> instead of
+              <code>{"type":"int"}</code>).</li>
+
+            <li> [FULLNAMES] Replace short names with fullnames, using
+              applicable namespaces to do so.  Then eliminate
+              <code>namespace</code> attributes, which are now redundant.</li>
+
+            <li> [STRIP] Keep only attributes that are relevant to
+              reserved properties, which are:
+              <code>type</code>, <code>name</code>,
+              <code>fields</code>, <code>symbols</code>,
+              <code>items</code>, <code>values</code>,
+              <code>logicalType</code>, <code>size</code>,
+              <code>order</code>, <code>doc</code>
+              <code>aliases</code> and <code>default</code>.
+              Strip all others user defined properties (e.g., 
<code>format</code>).</li>
+
+            <li> [ORDER] Order the appearance of fields of JSON objects
+              as follows: <code>name</code>, <code>type</code>,
+              <code>fields</code>, <code>symbols</code>,
+              <code>items</code>, <code>values</code>,
+              <code>logicalType</code>, <code>size</code>,

Review comment:
       Do we need to define the order for user/custom properties?  I'm tempted 
to say keep the 7 attributes from Parsing Canonical form in their order, 
followed by all other kept attributes in alphabetical order...

##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>

Review comment:
       Note: I proposed in the JIRA that we **don't** create a new 
specification for this form, and just consider getting the "plain" schema as an 
SDK tool issue.
   
   If we go that way, you can ignore the comments on this file :D

##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>
+
+        <p>One of defined way to normalize the avro schema using
+          <em>Standard Canonical Form Transformation</em>. This involves
+          stripping unwanted properties and maintain same canonical
+          ordering. The canonical ordering involves ordering avro
+          reserved properties followed by custom properties if mentioned while
+          transforming. Normalization schema which helps to reduce the
+          total memory size of schema (removed unwanted properties and 
whitespace)
+          while transfer avro schema between two system and also reduce the 
parsing
+          time for compatibility check and schema evolution.
+        </p>
+
+        <p><em>Standard Canonical Form</em> is a transformation of a schema
+          into standard canonical ordered. It contains only avro reserved
+          properties <code>"name", "type", "fields", "symbols", "items", 
"values",
+            "logicalType", "size", "order", "doc", "aliases", "default"</code>
+          and <em>other (custom properties)</em> schema properties.
+        </p>
+
+        <section>
+          <title>Transforming into Standard Canonical Form</title>
+
+          <p>Assuming an input schema (in JSON form) that's already
+            UTF-8 text for a <em>valid</em> Avro schema (including all
+            quotes as required by JSON), the following transformations
+            will produce its Standard Canonical Form:</p>
+          <ul>
+            <li> [PRIMITIVES] Convert primitive schemas to their simple
+              form (e.g., <code>int</code> instead of
+              <code>{"type":"int"}</code>).</li>
+
+            <li> [FULLNAMES] Replace short names with fullnames, using
+              applicable namespaces to do so.  Then eliminate
+              <code>namespace</code> attributes, which are now redundant.</li>
+
+            <li> [STRIP] Keep only attributes that are relevant to
+              reserved properties, which are:
+              <code>type</code>, <code>name</code>,
+              <code>fields</code>, <code>symbols</code>,
+              <code>items</code>, <code>values</code>,
+              <code>logicalType</code>, <code>size</code>,
+              <code>order</code>, <code>doc</code>
+              <code>aliases</code> and <code>default</code>.
+              Strip all others user defined properties (e.g., 
<code>format</code>).</li>
+
+            <li> [ORDER] Order the appearance of fields of JSON objects
+              as follows: <code>name</code>, <code>type</code>,
+              <code>fields</code>, <code>symbols</code>,
+              <code>items</code>, <code>values</code>,
+              <code>logicalType</code>, <code>size</code>,

Review comment:
       Can you switch these to keep the initial attributes the same as for 
Parsing Canonical Form: name, type, fields, symbols, items, values, size ?
   

##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>
+
+        <p>One of defined way to normalize the avro schema using
+          <em>Standard Canonical Form Transformation</em>. This involves
+          stripping unwanted properties and maintain same canonical
+          ordering. The canonical ordering involves ordering avro
+          reserved properties followed by custom properties if mentioned while
+          transforming. Normalization schema which helps to reduce the
+          total memory size of schema (removed unwanted properties and 
whitespace)
+          while transfer avro schema between two system and also reduce the 
parsing
+          time for compatibility check and schema evolution.
+        </p>
+
+        <p><em>Standard Canonical Form</em> is a transformation of a schema
+          into standard canonical ordered. It contains only avro reserved
+          properties <code>"name", "type", "fields", "symbols", "items", 
"values",
+            "logicalType", "size", "order", "doc", "aliases", "default"</code>

Review comment:
       It's probably worth mentioning that any properties the are used to 
configure a logical type should also be kept ("scale", "precision" and 
user-defined logical types.)  
   
   As a consequence, when generating a canonical form that includes a 
user-defined LogicalType, all languages should have the same defined attributes 
for that logical type.

##########
File path: lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
##########
@@ -160,14 +200,73 @@ private static Appendable build(Map<String, String> env, 
Schema s, Appendable o)
           else
             firstTime = false;
           o.append("{\"name\":\"").append(f.name()).append("\"");
-          build(env, f.schema(), o.append(",\"type\":")).append("}");
+          build(env, f.schema(), o.append(",\"type\":"), ps, aps);
+          if (!ps)
+            setFieldProps(o, f, aps); // if standard canonical form then add 
reserved properties
+          o.append("}");
         }
         o.append("]");
       }
+      if (!ps) {
+        setComplexProps(o, s);
+        setSimpleProps(o, s.getObjectProps(), aps);
+      } // adding the reserved property if not parser canonical schema
       return o.append("}");
     }
   }
 
+  private static Appendable writeLogicalType(Schema s, LogicalType lt, 
Appendable o, LinkedHashSet<String> aps)
+      throws IOException {
+    o.append("{\"type\":\"").append(s.getType().getName()).append("\"");
+    // adding the logical property
+    setLogicalProps(o, lt);
+    // adding the reserved property
+    setSimpleProps(o, s.getObjectProps(), aps);
+    return o.append("}");
+  }
+
+  private static void setLogicalProps(Appendable o, LogicalType lt) throws 
IOException {
+    
o.append(",\"").append(LogicalType.LOGICAL_TYPE_PROP).append("\":\"").append(lt.getName()).append("\"");

Review comment:
       I would expect user-defined logical type properties to work here!  (But 
that could be left as a known issue as well...)

##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>
+
+        <p>One of defined way to normalize the avro schema using
+          <em>Standard Canonical Form Transformation</em>. This involves
+          stripping unwanted properties and maintain same canonical
+          ordering. The canonical ordering involves ordering avro
+          reserved properties followed by custom properties if mentioned while
+          transforming. Normalization schema which helps to reduce the
+          total memory size of schema (removed unwanted properties and 
whitespace)
+          while transfer avro schema between two system and also reduce the 
parsing
+          time for compatibility check and schema evolution.
+        </p>
+
+        <p><em>Standard Canonical Form</em> is a transformation of a schema
+          into standard canonical ordered. It contains only avro reserved
+          properties <code>"name", "type", "fields", "symbols", "items", 
"values",
+            "logicalType", "size", "order", "doc", "aliases", "default"</code>
+          and <em>other (custom properties)</em> schema properties.
+        </p>
+
+        <section>
+          <title>Transforming into Standard Canonical Form</title>
+
+          <p>Assuming an input schema (in JSON form) that's already
+            UTF-8 text for a <em>valid</em> Avro schema (including all
+            quotes as required by JSON), the following transformations
+            will produce its Standard Canonical Form:</p>
+          <ul>
+            <li> [PRIMITIVES] Convert primitive schemas to their simple
+              form (e.g., <code>int</code> instead of
+              <code>{"type":"int"}</code>).</li>
+
+            <li> [FULLNAMES] Replace short names with fullnames, using
+              applicable namespaces to do so.  Then eliminate
+              <code>namespace</code> attributes, which are now redundant.</li>
+
+            <li> [STRIP] Keep only attributes that are relevant to
+              reserved properties, which are:
+              <code>type</code>, <code>name</code>,
+              <code>fields</code>, <code>symbols</code>,
+              <code>items</code>, <code>values</code>,
+              <code>logicalType</code>, <code>size</code>,
+              <code>order</code>, <code>doc</code>
+              <code>aliases</code> and <code>default</code>.
+              Strip all others user defined properties (e.g., 
<code>format</code>).</li>
+
+            <li> [ORDER] Order the appearance of fields of JSON objects
+              as follows: <code>name</code>, <code>type</code>,
+              <code>fields</code>, <code>symbols</code>,
+              <code>items</code>, <code>values</code>,
+              <code>logicalType</code>, <code>size</code>,
+              <code>order</code>, <code>doc</code>,
+              <code>aliases</code>, <code>default</code>.
+              For example, if an object has <code>type</code>,
+              <code>name</code>, and <code>size</code> fields, then the
+              <code>name</code> field should appear first, followed by the
+              <code>type</code> and then the <code>size</code> fields.</li>
+
+            <li> [STRINGS] For all JSON string literals in the schema
+              text, replace any escaped characters (e.g., \uXXXX escapes)
+              with their UTF-8 equivalents.</li>
+
+            <li> [INTEGERS] Eliminate quotes around and any leading
+              zeros in front of JSON integer literals (which appear in the
+              <code>size</code> attributes of <code>fixed</code> schemas).</li>

Review comment:
       Also appear in default attributes now.  We also have floating point 
numbers that should be normalized.
   
   JSON arrays should be fine, but JSON objects in defaults should probably 
have their fields ordered alphabetically.

##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>
+
+        <p>One of defined way to normalize the avro schema using
+          <em>Standard Canonical Form Transformation</em>. This involves
+          stripping unwanted properties and maintain same canonical
+          ordering. The canonical ordering involves ordering avro
+          reserved properties followed by custom properties if mentioned while
+          transforming. Normalization schema which helps to reduce the
+          total memory size of schema (removed unwanted properties and 
whitespace)
+          while transfer avro schema between two system and also reduce the 
parsing
+          time for compatibility check and schema evolution.
+        </p>
+
+        <p><em>Standard Canonical Form</em> is a transformation of a schema
+          into standard canonical ordered. It contains only avro reserved
+          properties <code>"name", "type", "fields", "symbols", "items", 
"values",
+            "logicalType", "size", "order", "doc", "aliases", "default"</code>
+          and <em>other (custom properties)</em> schema properties.
+        </p>
+
+        <section>
+          <title>Transforming into Standard Canonical Form</title>
+
+          <p>Assuming an input schema (in JSON form) that's already
+            UTF-8 text for a <em>valid</em> Avro schema (including all
+            quotes as required by JSON), the following transformations
+            will produce its Standard Canonical Form:</p>
+          <ul>
+            <li> [PRIMITIVES] Convert primitive schemas to their simple
+              form (e.g., <code>int</code> instead of
+              <code>{"type":"int"}</code>).</li>
+
+            <li> [FULLNAMES] Replace short names with fullnames, using
+              applicable namespaces to do so.  Then eliminate
+              <code>namespace</code> attributes, which are now redundant.</li>
+
+            <li> [STRIP] Keep only attributes that are relevant to
+              reserved properties, which are:
+              <code>type</code>, <code>name</code>,

Review comment:
       It's odd that the order specified here is different than below.  It's 
also the case for Parsing Canonical Form documentation though...
   
   A good question -- size is "sometimes" relevant, and _sometimes_ ignored, 
for example.   In the Java SDK we can add it as an attribute to a field but not 
an enum, and it's stripped from the field despite being a "kept" attribute.  Is 
this a bug with canonical form in general?  What do we want to happen with the 
Standard Canonical Form?

##########
File path: lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
##########
@@ -17,17 +17,22 @@
  */
 package org.apache.avro;
 
+import org.apache.avro.util.internal.JacksonUtils;

Review comment:
       This class is currently Jackson-independent... I think the JSON is being 
written manually here to ensure that it's isolated from any specific JSON 
implementations, and the trend of the last few versions has been to reduce the 
dependency on Jackson in particular.
   
   Is it possible to avoid introducing this class?

##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
         </ul>
       </section>
 
+      <section>
+        <title>Standard Canonical Form for Schemas</title>
+
+        <p>One of defined way to normalize the avro schema using
+          <em>Standard Canonical Form Transformation</em>. This involves
+          stripping unwanted properties and maintain same canonical
+          ordering. The canonical ordering involves ordering avro
+          reserved properties followed by custom properties if mentioned while
+          transforming. Normalization schema which helps to reduce the
+          total memory size of schema (removed unwanted properties and 
whitespace)
+          while transfer avro schema between two system and also reduce the 
parsing
+          time for compatibility check and schema evolution.
+        </p>
+
+        <p><em>Standard Canonical Form</em> is a transformation of a schema
+          into standard canonical ordered. It contains only avro reserved
+          properties <code>"name", "type", "fields", "symbols", "items", 
"values",
+            "logicalType", "size", "order", "doc", "aliases", "default"</code>

Review comment:
       Also, if we're keeping default, we should keep all of the attributes 
that might appear in a JSON object in the default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [avro] RyanSkraba commented on a change in pull request #805: AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase.

Reply via email to