chenhao-db commented on code in PR #45479:
URL: https://github.com/apache/spark/pull/45479#discussion_r1525473094
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##########
@@ -107,4 +107,17 @@ public String toString() {
// NOTE: the encoding is not yet implemented, this is not the final
implementation.
return new String(value);
}
+
+ /**
+ * Compare two variants in bytes. The variant equality is more complex than
it, and we haven't
+ * supported it in the user surface yet. This method is only intended for
tests.
+ */
+ @Override
+ public boolean equals(Object other) {
Review Comment:
Done.
##########
common/variant/src/main/java/org/apache/spark/variant/Variant.java:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.variant;
+
+/**
+ * This class is structurally equivalent to {@link
org.apache.spark.unsafe.types.VariantVal}. We
+ * define a new class to avoid depending on or modifying Spark.
Review Comment:
I think it is better to have separate classes. It allows Spark to have full
control of its internal in-memory format and the variant library can have more
flexbility. I will later introduce a starting offset in `Variant`, but not in
`VariantVal`.
##########
common/variant/README.md:
##########
@@ -0,0 +1,127 @@
+# Overview
+
+A Variant represents a type that contain one of:
+- Primitive: A type and corresponding value (e.g. INT, STRING)
+- Array: An ordered list of Variant values
+- Object: An unordered collection of string/Variant pairs (i.e. key/value
pairs). An object may not contain duplicate keys.
+
+A variant is encoded with 2 binary values, the value and the metadata.
+
+There are a fixed number of allowed primitive types, provided in the table
below. These represent a commonly supported subset of the [logical
types](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
allowed by the Parquet.
+
+The Variant spec allows representation of semi-structured data (e.g. JSON) in
a form that can be efficiently queried by path. The design is intended to allow
efficient access to nested data even in the presence of very wide or deep
structures.
+
+Another motivation for the representation is that (aside from metadata) each
inner Variant value is contiguous and self-contained. For example, in a Variant
containing an Array of Variant values, the representation of an inner Variant
value, when paired with the metadata of the full variant, is itself a valid
Variant.
+
+# Metadata encoding
+
+The grammar for encoded metadata is as follows
+
+```
+metadata: <header> <dictionary_size> <dictionary>
+header: 1 byte (<version> | <sorted_strings> << 4 | (<offset_size_minus_one>
<< 6))
+version: a 4-bit version ID. Currently, must always contain the value 1
Review Comment:
It is expected that the version will be updated very rarely. Even if we ever
have the need of more than 16 versions, we can use a special version value to
indicate there is another byte following it to further describe the version.
This is a valid solution because different versions can have different binary
formats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]