[GitHub] [hudi] umehrot2 commented on a change in pull request #2208: [HUDI-1040] Make Hudi support Spark 3

GitBox Thu, 29 Oct 2020 16:11:25 -0700


umehrot2 commented on a change in pull request #2208:
URL: https://github.com/apache/hudi/pull/2208#discussion_r514596118




##########
File path: packaging/hudi-utilities-bundle/pom.xml
##########
@@ -105,6 +106,7 @@
                   <include>io.prometheus:simpleclient_common</include>
                   <include>com.yammer.metrics:metrics-core</include>
                   
<include>org.apache.spark:spark-streaming-kafka-0-10_${scala.binary.version}</include>
+                  
<include>org.apache.spark:spark-token-provider-kafka-0-10_${scala.binary.version}</include>

Review comment:
       For my understanding, why is this needed ?

##########
File path: LICENSE
##########
@@ -246,6 +246,8 @@ This product includes code from Apache Spark
 
 * org.apache.hudi.AvroConversionHelper copied from classes in 
org/apache/spark/sql/avro package
 
+* org.apache.hudi.HoodieSparkUtils.scala copied from 
org.apache.spark.deploy.SparkHadoopUtil.scala

Review comment:
       Perhaps we can be more specific that we `copied some methods` ?

##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -121,6 +122,9 @@ private[hudi] object HoodieSparkSqlWriter {
       // short-circuit if bulk_insert via row is enabled.
       // scalastyle:off
       if (parameters(ENABLE_ROW_WRITER_OPT_KEY).toBoolean) {
+        if (SPARK_VERSION.startsWith("3.")) {
+          throw new HoodieException("Bulk insert via row is not compatible 
with Spark 3, it is only compatible with Spark 2!")
+        }

Review comment:
       Is this not possible through delta streamer ? Seems like not.

##########
File path: pom.xml
##########
@@ -100,6 +104,7 @@
     <prometheus.version>0.8.0</prometheus.version>
     <http.version>4.4.1</http.version>
     <spark.version>2.4.4</spark.version>
+    <spark2.version>2.4.4</spark2.version>

Review comment:
       I would suggest just keeping `spark.version` here. Override the 
`spark.version` respectively in `hudi-spark2` and `hudi-spark3` modules.

##########
File path: hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java
##########
@@ -18,7 +18,7 @@
 
 package org.apache.hudi.internal;
 
-import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceUtilsForSpark2;

Review comment:
       What is the need to move the `hudi datasource` itself to `hudi-spark2` ? 
I think we should leave it under `hudi-spark` and later if we want to have 
separate datasource implementations we can create separately under 
`hudi-spark2` and `hudi-spark3` modules. Thoughts ?

##########
File path: 
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java
##########
@@ -85,7 +85,7 @@
         .collect(Collectors.toList()));
 
     return inputPaths.stream().map(path -> {
-      setInputPath(jobConf, path);
+      FileInputFormat.setInputPaths(jobConf, path);

Review comment:
       As discussed internally regarding this in the code review, can you 
confirm if this is actually converting paths to point to local file system and 
not HDFS ? Also would be good to explain why you did this for reference in the 
description.

##########
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
##########
@@ -96,4 +98,16 @@ object AvroConversionUtils {
     val name = HoodieAvroUtils.sanitizeName(tableName)
     (s"${name}_record", s"hoodie.${name}")
   }
+
+  private def deserializeRow(encoder: ExpressionEncoder[Row], internalRow: 
InternalRow): Row = {
+    // TODO remove reflection if Spark 2.x support is dropped
+    if (SPARK_VERSION.startsWith("2.")) {

Review comment:
       +1 Lets have two separate implementations of the Row Deserializer for 
spark 2 and spark 3, as was done in 
https://github.com/apache/hudi/pull/1760/files




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on a change in pull request #2208: [HUDI-1040] Make Hudi support Spark 3

Reply via email to