Re: [PR] [SPARK-45827][SQL] Add Variant data type in Spark. [spark]

via GitHub Fri, 10 Nov 2023 00:58:58 -0800


beliefer commented on code in PR #43707:
URL: https://github.com/apache/spark/pull/43707#discussion_r1389100026



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala:
##########
@@ -808,6 +809,9 @@ object FunctionRegistry {
     expression[LengthOfJsonArray]("json_array_length"),
     expression[JsonObjectKeys]("json_object_keys"),
 
+    // Variant
+    expression[ParseJson]("parse_json"),

Review Comment:
   Could we implement parse_json with another PR?



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java:
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.types;
+
+import org.apache.spark.unsafe.Platform;
+
+import java.io.Serializable;
+import java.util.Arrays;
+
+/**
+ * The physical data representation of {@link 
org.apache.spark.sql.types.VariantType} that
+ * represents a semi-structured value. It consists of two binary values: 
{@link VariantVal#value}
+ * and {@link VariantVal#metadata}. The value encodes types and values, but 
not field names. The
+ * metadata currently contains a version flag and a list of field names. We 
can extend/modify the
+ * detailed binary format given the version flag.
+ * <p>
+ * A {@link VariantVal} can be produced by casting another value into the 
Variant type or parsing a
+ * JSON string in the {@link 
org.apache.spark.sql.catalyst.expressions.variant.ParseJson}
+ * expression. We can extract a path consisting of field names and array 
indices from it, cast it
+ * into a concrete data type, or rebuild a JSON string from it.
+ * <p>
+ * The storage layout of this class in {@link 
org.apache.spark.sql.catalyst.expressions.UnsafeRow}
+ * and {@link org.apache.spark.sql.catalyst.expressions.UnsafeArrayData} is: 
the fixed-size part is
+ * a long value "offsetAndSize". The upper 32 bits is the offset that points 
to the start position
+ * of the actual binary content. The lower 32 bits is the total length of the 
binary content. The
+ * binary content contains: 4 bytes representing the length of {@link 
VariantVal#value}, content of
+ * {@link VariantVal#value}, content of {@link VariantVal#metadata}. This is 
an internal and
+ * transient format and can be modified at any time.
+ */
+public class VariantVal implements Serializable {
+  protected final byte[] value;
+  protected final byte[] metadata;
+
+  public VariantVal(byte[] value, byte[] metadata) {
+    this.value = value;
+    this.metadata = metadata;
+  }
+
+  public byte[] getValue() {
+    return value;
+  }
+
+  public byte[] getMetadata() {
+    return metadata;
+  }
+
+  /**
+   * This function writes the binary content into {@code buffer} starting from 
{@code cursor}, as
+   * described in the class comment. The caller should guarantee there is 
enough space in `buffer`.
+   */
+  public void writeIntoUnsafeRow(byte[] buffer, long cursor) {
+    Platform.putInt(buffer, cursor, value.length);
+    Platform.copyMemory(value, Platform.BYTE_ARRAY_OFFSET, buffer, cursor + 4, 
value.length);
+    Platform.copyMemory(
+        metadata,
+        Platform.BYTE_ARRAY_OFFSET,
+        buffer,
+        cursor + 4 + value.length,
+        metadata.length
+    );
+  }
+
+  /**
+   * This function reads the binary content described in `writeIntoUnsafeRow` 
from `baseObject`. The
+   * offset is computed by adding the offset in {@code offsetAndSize} and 
{@code baseOffset}.
+   */
+  public static VariantVal readFromUnsafeRow(long offsetAndSize, Object 
baseObject,
+                                             long baseOffset) {

Review Comment:
   ```suggestion
     public static VariantVal readFromUnsafeRow(
         long offsetAndSize,
         Object baseObject,
         ong baseOffset) {
   ```



##########
sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala:
##########
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+import org.apache.spark.annotation.Stable
+
+/**
+ * The data type representing semi-structured values with arbitrary 
hierarchical data structures. At
+ * this moment, it is intended to store parsed JSON values and almost any 
other data types in the
+ * system (e.g., we don't plan to let it store a map with a non-string key 
type). In the future, we
+ * may also extend it to store other semi-structured data representation like 
XML.
+ */
+@Stable
+class VariantType private () extends AtomicType {
+  // The default size is used in query planning to drive optimization 
decisions. 2048 is arbitrarily
+  // picked and we currently don't have any data to support it. This may need 
revisiting later.
+  override def defaultSize: Int = 2048

Review Comment:
   How do we get the actual length cheaply?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45827][SQL] Add Variant data type in Spark. [spark]

Reply via email to