[GitHub] [hudi] vinothchandar commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

GitBox Fri, 04 Jun 2021 05:30:32 -0700


vinothchandar commented on a change in pull request #2645:
URL: https://github.com/apache/hudi/pull/2645#discussion_r645514178




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##########
@@ -60,6 +61,8 @@
   public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
   public static final String HOODIE_TABLE_PRECOMBINE_FIELD = 
"hoodie.table.precombine.field";
   public static final String HOODIE_TABLE_PARTITION_COLUMNS = 
"hoodie.table.partition.columns";
+  public static final String HOODIE_TABLE_RECORDKEY_FIELDS = 
"hoodie.table.recordkey.fields";

Review comment:
       lets use columns or fields consistently? May be we should rename 
`hoodie.table.partition.columns` (and the related variables, methods) to 
`hoodie.table.partition.fields`?. This change is recent/unreleased correct? So 
we don't have to worry about backwards compat etc. This can happen in a 
separate PR. 
   

##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
##########
@@ -189,8 +189,12 @@ protected boolean isUpdateRecord(HoodieRecord<T> 
hoodieRecord) {
   private Option<IndexedRecord> getIndexedRecord(HoodieRecord<T> hoodieRecord) 
{
     Option<Map<String, String>> recordMetadata = 
hoodieRecord.getData().getMetadata();
     try {
-      Option<IndexedRecord> avroRecord = 
hoodieRecord.getData().getInsertValue(writerSchema);
+      Option<IndexedRecord> avroRecord = 
hoodieRecord.getData().getInsertValue(inputSchema,
+          config.getProps());
       if (avroRecord.isPresent()) {
+        if (avroRecord.get().equals(IGNORE_RECORD)) {

Review comment:
       I am still curious how this equals() works with real shuffles i.e data 
transferred across machines. IIUC the comparision is delegated to 
Object.equals(). Could this be different on different JVMs? i.e we get a 
IGNORE_RECORD out of network after shuffle, and its different from the hashCode 
on the local jvm `HoodieWriteHandle.IGNORE_RECORD` 

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
##########
@@ -592,6 +592,8 @@ public static PropertyBuilder withPropertyBuilder() {
 
     private HoodieTableType tableType;
     private String tableName;
+    private String tableSchema;

Review comment:
       should this be called `tableCreateSchema`?

##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/SerDeUtils.scala
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import java.io.ByteArrayOutputStream
+
+import com.esotericsoftware.kryo.Kryo
+import com.esotericsoftware.kryo.io.{Input, Output}
+import org.apache.spark.SparkConf
+import org.apache.spark.serializer.KryoSerializer
+
+
+object SerDeUtils {

Review comment:
       any unit tests for these?

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##########
@@ -60,6 +61,8 @@
   public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
   public static final String HOODIE_TABLE_PRECOMBINE_FIELD = 
"hoodie.table.precombine.field";
   public static final String HOODIE_TABLE_PARTITION_COLUMNS = 
"hoodie.table.partition.columns";
+  public static final String HOODIE_TABLE_RECORDKEY_FIELDS = 
"hoodie.table.recordkey.fields";
+  public static final String HOODIE_TABLE_CREATE_SCHEMA = 
"hoodie.table.create.schema";

Review comment:
       something to double check. we should properly encode the schema string 
for the table here with escaping if needed

##########
File path: hudi-spark-datasource/hudi-spark2/pom.xml
##########
@@ -29,6 +29,7 @@
 
   <properties>
     <main.basedir>${project.parent.parent.basedir}</main.basedir>
+    <scala.version>${scala11.version}</scala.version>

Review comment:
       why is this necessary? spark2 can be used with 2_12 right?

##########
File path: packaging/hudi-spark-bundle/pom.xml
##########
@@ -66,10 +66,9 @@
                   <include>org.apache.hudi:hudi-common</include>
                   <include>org.apache.hudi:hudi-client-common</include>
                   <include>org.apache.hudi:hudi-spark-client</include>
-                  <include>org.apache.hudi:hudi-spark-common</include>
+                  
<include>org.apache.hudi:hudi-spark-common_${scala.binary.version}</include>
                   
<include>org.apache.hudi:hudi-spark_${scala.binary.version}</include>
-                  
<include>org.apache.hudi:hudi-spark2_${scala.binary.version}</include>
-                  <include>org.apache.hudi:hudi-spark3_2.12</include>
+                  
<include>org.apache.hudi:${hudi.spark.module}_${scala.binary.version}</include>

Review comment:
       so we had designed the bundle such that the hudi-spark2 and hudi-spark3 
just need to be compiled differently, but included into a single bundle, and it 
works because we load up one of those dynamically based on spark version. Is 
that not working somehow?

##########
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlUtils.scala
##########
@@ -0,0 +1,184 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi
+
+import scala.collection.JavaConverters._
+import java.net.URI
+import java.util.Locale
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.SparkAdapterSupport
+import org.apache.hudi.common.model.HoodieRecord
+import org.apache.spark.SPARK_VERSION
+import org.apache.spark.sql.{Column, DataFrame, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.expressions.{And, Cast, Expression, 
Literal}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, SubqueryAlias}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{DataType, NullType, StringType, 
StructField, StructType}
+
+import scala.collection.immutable.Map
+
+object HoodieSqlUtils extends SparkAdapterSupport {

Review comment:
       are there some unit tests for these?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #2645: [HUDI-1659] Basic Implementation Of Spark Sql Support

Reply via email to