RyanSkraba commented on a change in pull request #805:
URL: https://github.com/apache/avro/pull/805#discussion_r442039489
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
+
+ <p>One of defined way to normalize the avro schema using
+ <em>Standard Canonical Form Transformation</em>. This involves
+ stripping unwanted properties and maintain same canonical
+ ordering. The canonical ordering involves ordering avro
+ reserved properties followed by custom properties if mentioned while
+ transforming. Normalization schema which helps to reduce the
+ total memory size of schema (removed unwanted properties and
whitespace)
+ while transfer avro schema between two system and also reduce the
parsing
+ time for compatibility check and schema evolution.
+ </p>
+
+ <p><em>Standard Canonical Form</em> is a transformation of a schema
+ into standard canonical ordered. It contains only avro reserved
+ properties <code>"name", "type", "fields", "symbols", "items",
"values",
+ "logicalType", "size", "order", "doc", "aliases", "default"</code>
+ and <em>other (custom properties)</em> schema properties.
+ </p>
+
+ <section>
+ <title>Transforming into Standard Canonical Form</title>
+
+ <p>Assuming an input schema (in JSON form) that's already
+ UTF-8 text for a <em>valid</em> Avro schema (including all
+ quotes as required by JSON), the following transformations
+ will produce its Standard Canonical Form:</p>
+ <ul>
+ <li> [PRIMITIVES] Convert primitive schemas to their simple
+ form (e.g., <code>int</code> instead of
+ <code>{"type":"int"}</code>).</li>
+
+ <li> [FULLNAMES] Replace short names with fullnames, using
+ applicable namespaces to do so. Then eliminate
+ <code>namespace</code> attributes, which are now redundant.</li>
+
+ <li> [STRIP] Keep only attributes that are relevant to
+ reserved properties, which are:
+ <code>type</code>, <code>name</code>,
+ <code>fields</code>, <code>symbols</code>,
+ <code>items</code>, <code>values</code>,
+ <code>logicalType</code>, <code>size</code>,
+ <code>order</code>, <code>doc</code>
+ <code>aliases</code> and <code>default</code>.
+ Strip all others user defined properties (e.g.,
<code>format</code>).</li>
+
+ <li> [ORDER] Order the appearance of fields of JSON objects
+ as follows: <code>name</code>, <code>type</code>,
+ <code>fields</code>, <code>symbols</code>,
+ <code>items</code>, <code>values</code>,
+ <code>logicalType</code>, <code>size</code>,
Review comment:
Do we need to define the order for user/custom properties? I'm tempted
to say keep the 7 attributes from Parsing Canonical form in their order,
followed by all other kept attributes in alphabetical order...
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
Review comment:
Note: I proposed in the JIRA that we **don't** create a new
specification for this form, and just consider getting the "plain" schema as an
SDK tool issue.
If we go that way, you can ignore the comments on this file :D
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
+
+ <p>One of defined way to normalize the avro schema using
+ <em>Standard Canonical Form Transformation</em>. This involves
+ stripping unwanted properties and maintain same canonical
+ ordering. The canonical ordering involves ordering avro
+ reserved properties followed by custom properties if mentioned while
+ transforming. Normalization schema which helps to reduce the
+ total memory size of schema (removed unwanted properties and
whitespace)
+ while transfer avro schema between two system and also reduce the
parsing
+ time for compatibility check and schema evolution.
+ </p>
+
+ <p><em>Standard Canonical Form</em> is a transformation of a schema
+ into standard canonical ordered. It contains only avro reserved
+ properties <code>"name", "type", "fields", "symbols", "items",
"values",
+ "logicalType", "size", "order", "doc", "aliases", "default"</code>
+ and <em>other (custom properties)</em> schema properties.
+ </p>
+
+ <section>
+ <title>Transforming into Standard Canonical Form</title>
+
+ <p>Assuming an input schema (in JSON form) that's already
+ UTF-8 text for a <em>valid</em> Avro schema (including all
+ quotes as required by JSON), the following transformations
+ will produce its Standard Canonical Form:</p>
+ <ul>
+ <li> [PRIMITIVES] Convert primitive schemas to their simple
+ form (e.g., <code>int</code> instead of
+ <code>{"type":"int"}</code>).</li>
+
+ <li> [FULLNAMES] Replace short names with fullnames, using
+ applicable namespaces to do so. Then eliminate
+ <code>namespace</code> attributes, which are now redundant.</li>
+
+ <li> [STRIP] Keep only attributes that are relevant to
+ reserved properties, which are:
+ <code>type</code>, <code>name</code>,
+ <code>fields</code>, <code>symbols</code>,
+ <code>items</code>, <code>values</code>,
+ <code>logicalType</code>, <code>size</code>,
+ <code>order</code>, <code>doc</code>
+ <code>aliases</code> and <code>default</code>.
+ Strip all others user defined properties (e.g.,
<code>format</code>).</li>
+
+ <li> [ORDER] Order the appearance of fields of JSON objects
+ as follows: <code>name</code>, <code>type</code>,
+ <code>fields</code>, <code>symbols</code>,
+ <code>items</code>, <code>values</code>,
+ <code>logicalType</code>, <code>size</code>,
Review comment:
Can you switch these to keep the initial attributes the same as for
Parsing Canonical Form: name, type, fields, symbols, items, values, size ?
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
+
+ <p>One of defined way to normalize the avro schema using
+ <em>Standard Canonical Form Transformation</em>. This involves
+ stripping unwanted properties and maintain same canonical
+ ordering. The canonical ordering involves ordering avro
+ reserved properties followed by custom properties if mentioned while
+ transforming. Normalization schema which helps to reduce the
+ total memory size of schema (removed unwanted properties and
whitespace)
+ while transfer avro schema between two system and also reduce the
parsing
+ time for compatibility check and schema evolution.
+ </p>
+
+ <p><em>Standard Canonical Form</em> is a transformation of a schema
+ into standard canonical ordered. It contains only avro reserved
+ properties <code>"name", "type", "fields", "symbols", "items",
"values",
+ "logicalType", "size", "order", "doc", "aliases", "default"</code>
Review comment:
It's probably worth mentioning that any properties the are used to
configure a logical type should also be kept ("scale", "precision" and
user-defined logical types.)
As a consequence, when generating a canonical form that includes a
user-defined LogicalType, all languages should have the same defined attributes
for that logical type.
##########
File path: lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
##########
@@ -160,14 +200,73 @@ private static Appendable build(Map<String, String> env,
Schema s, Appendable o)
else
firstTime = false;
o.append("{\"name\":\"").append(f.name()).append("\"");
- build(env, f.schema(), o.append(",\"type\":")).append("}");
+ build(env, f.schema(), o.append(",\"type\":"), ps, aps);
+ if (!ps)
+ setFieldProps(o, f, aps); // if standard canonical form then add
reserved properties
+ o.append("}");
}
o.append("]");
}
+ if (!ps) {
+ setComplexProps(o, s);
+ setSimpleProps(o, s.getObjectProps(), aps);
+ } // adding the reserved property if not parser canonical schema
return o.append("}");
}
}
+ private static Appendable writeLogicalType(Schema s, LogicalType lt,
Appendable o, LinkedHashSet<String> aps)
+ throws IOException {
+ o.append("{\"type\":\"").append(s.getType().getName()).append("\"");
+ // adding the logical property
+ setLogicalProps(o, lt);
+ // adding the reserved property
+ setSimpleProps(o, s.getObjectProps(), aps);
+ return o.append("}");
+ }
+
+ private static void setLogicalProps(Appendable o, LogicalType lt) throws
IOException {
+
o.append(",\"").append(LogicalType.LOGICAL_TYPE_PROP).append("\":\"").append(lt.getName()).append("\"");
Review comment:
I would expect user-defined logical type properties to work here! (But
that could be left as a known issue as well...)
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
+
+ <p>One of defined way to normalize the avro schema using
+ <em>Standard Canonical Form Transformation</em>. This involves
+ stripping unwanted properties and maintain same canonical
+ ordering. The canonical ordering involves ordering avro
+ reserved properties followed by custom properties if mentioned while
+ transforming. Normalization schema which helps to reduce the
+ total memory size of schema (removed unwanted properties and
whitespace)
+ while transfer avro schema between two system and also reduce the
parsing
+ time for compatibility check and schema evolution.
+ </p>
+
+ <p><em>Standard Canonical Form</em> is a transformation of a schema
+ into standard canonical ordered. It contains only avro reserved
+ properties <code>"name", "type", "fields", "symbols", "items",
"values",
+ "logicalType", "size", "order", "doc", "aliases", "default"</code>
+ and <em>other (custom properties)</em> schema properties.
+ </p>
+
+ <section>
+ <title>Transforming into Standard Canonical Form</title>
+
+ <p>Assuming an input schema (in JSON form) that's already
+ UTF-8 text for a <em>valid</em> Avro schema (including all
+ quotes as required by JSON), the following transformations
+ will produce its Standard Canonical Form:</p>
+ <ul>
+ <li> [PRIMITIVES] Convert primitive schemas to their simple
+ form (e.g., <code>int</code> instead of
+ <code>{"type":"int"}</code>).</li>
+
+ <li> [FULLNAMES] Replace short names with fullnames, using
+ applicable namespaces to do so. Then eliminate
+ <code>namespace</code> attributes, which are now redundant.</li>
+
+ <li> [STRIP] Keep only attributes that are relevant to
+ reserved properties, which are:
+ <code>type</code>, <code>name</code>,
+ <code>fields</code>, <code>symbols</code>,
+ <code>items</code>, <code>values</code>,
+ <code>logicalType</code>, <code>size</code>,
+ <code>order</code>, <code>doc</code>
+ <code>aliases</code> and <code>default</code>.
+ Strip all others user defined properties (e.g.,
<code>format</code>).</li>
+
+ <li> [ORDER] Order the appearance of fields of JSON objects
+ as follows: <code>name</code>, <code>type</code>,
+ <code>fields</code>, <code>symbols</code>,
+ <code>items</code>, <code>values</code>,
+ <code>logicalType</code>, <code>size</code>,
+ <code>order</code>, <code>doc</code>,
+ <code>aliases</code>, <code>default</code>.
+ For example, if an object has <code>type</code>,
+ <code>name</code>, and <code>size</code> fields, then the
+ <code>name</code> field should appear first, followed by the
+ <code>type</code> and then the <code>size</code> fields.</li>
+
+ <li> [STRINGS] For all JSON string literals in the schema
+ text, replace any escaped characters (e.g., \uXXXX escapes)
+ with their UTF-8 equivalents.</li>
+
+ <li> [INTEGERS] Eliminate quotes around and any leading
+ zeros in front of JSON integer literals (which appear in the
+ <code>size</code> attributes of <code>fixed</code> schemas).</li>
Review comment:
Also appear in default attributes now. We also have floating point
numbers that should be normalized.
JSON arrays should be fine, but JSON objects in defaults should probably
have their fields ordered alphabetically.
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
+
+ <p>One of defined way to normalize the avro schema using
+ <em>Standard Canonical Form Transformation</em>. This involves
+ stripping unwanted properties and maintain same canonical
+ ordering. The canonical ordering involves ordering avro
+ reserved properties followed by custom properties if mentioned while
+ transforming. Normalization schema which helps to reduce the
+ total memory size of schema (removed unwanted properties and
whitespace)
+ while transfer avro schema between two system and also reduce the
parsing
+ time for compatibility check and schema evolution.
+ </p>
+
+ <p><em>Standard Canonical Form</em> is a transformation of a schema
+ into standard canonical ordered. It contains only avro reserved
+ properties <code>"name", "type", "fields", "symbols", "items",
"values",
+ "logicalType", "size", "order", "doc", "aliases", "default"</code>
+ and <em>other (custom properties)</em> schema properties.
+ </p>
+
+ <section>
+ <title>Transforming into Standard Canonical Form</title>
+
+ <p>Assuming an input schema (in JSON form) that's already
+ UTF-8 text for a <em>valid</em> Avro schema (including all
+ quotes as required by JSON), the following transformations
+ will produce its Standard Canonical Form:</p>
+ <ul>
+ <li> [PRIMITIVES] Convert primitive schemas to their simple
+ form (e.g., <code>int</code> instead of
+ <code>{"type":"int"}</code>).</li>
+
+ <li> [FULLNAMES] Replace short names with fullnames, using
+ applicable namespaces to do so. Then eliminate
+ <code>namespace</code> attributes, which are now redundant.</li>
+
+ <li> [STRIP] Keep only attributes that are relevant to
+ reserved properties, which are:
+ <code>type</code>, <code>name</code>,
Review comment:
It's odd that the order specified here is different than below. It's
also the case for Parsing Canonical Form documentation though...
A good question -- size is "sometimes" relevant, and _sometimes_ ignored,
for example. In the Java SDK we can add it as an attribute to a field but not
an enum, and it's stripped from the field despite being a "kept" attribute. Is
this a bug with canonical form in general? What do we want to happen with the
Standard Canonical Form?
##########
File path: lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
##########
@@ -17,17 +17,22 @@
*/
package org.apache.avro;
+import org.apache.avro.util.internal.JacksonUtils;
Review comment:
This class is currently Jackson-independent... I think the JSON is being
written manually here to ensure that it's isolated from any specific JSON
implementations, and the trend of the last few versions has been to reduce the
dependency on Jackson in particular.
Is it possible to avoid introducing this class?
##########
File path: doc/src/content/xdocs/spec.xml
##########
@@ -1310,6 +1310,92 @@
</ul>
</section>
+ <section>
+ <title>Standard Canonical Form for Schemas</title>
+
+ <p>One of defined way to normalize the avro schema using
+ <em>Standard Canonical Form Transformation</em>. This involves
+ stripping unwanted properties and maintain same canonical
+ ordering. The canonical ordering involves ordering avro
+ reserved properties followed by custom properties if mentioned while
+ transforming. Normalization schema which helps to reduce the
+ total memory size of schema (removed unwanted properties and
whitespace)
+ while transfer avro schema between two system and also reduce the
parsing
+ time for compatibility check and schema evolution.
+ </p>
+
+ <p><em>Standard Canonical Form</em> is a transformation of a schema
+ into standard canonical ordered. It contains only avro reserved
+ properties <code>"name", "type", "fields", "symbols", "items",
"values",
+ "logicalType", "size", "order", "doc", "aliases", "default"</code>
Review comment:
Also, if we're keeping default, we should keep all of the attributes
that might appear in a JSON object in the default.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]