umehrot2 commented on a change in pull request #2208:
URL: https://github.com/apache/hudi/pull/2208#discussion_r514596118
##########
File path: packaging/hudi-utilities-bundle/pom.xml
##########
@@ -105,6 +106,7 @@
<include>io.prometheus:simpleclient_common</include>
<include>com.yammer.metrics:metrics-core</include>
<include>org.apache.spark:spark-streaming-kafka-0-10_${scala.binary.version}</include>
+
<include>org.apache.spark:spark-token-provider-kafka-0-10_${scala.binary.version}</include>
Review comment:
For my understanding, why is this needed ?
##########
File path: LICENSE
##########
@@ -246,6 +246,8 @@ This product includes code from Apache Spark
* org.apache.hudi.AvroConversionHelper copied from classes in
org/apache/spark/sql/avro package
+* org.apache.hudi.HoodieSparkUtils.scala copied from
org.apache.spark.deploy.SparkHadoopUtil.scala
Review comment:
Perhaps we can be more specific that we `copied some methods` ?
##########
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##########
@@ -121,6 +122,9 @@ private[hudi] object HoodieSparkSqlWriter {
// short-circuit if bulk_insert via row is enabled.
// scalastyle:off
if (parameters(ENABLE_ROW_WRITER_OPT_KEY).toBoolean) {
+ if (SPARK_VERSION.startsWith("3.")) {
+ throw new HoodieException("Bulk insert via row is not compatible
with Spark 3, it is only compatible with Spark 2!")
+ }
Review comment:
Is this not possible through delta streamer ? Seems like not.
##########
File path: pom.xml
##########
@@ -100,6 +104,7 @@
<prometheus.version>0.8.0</prometheus.version>
<http.version>4.4.1</http.version>
<spark.version>2.4.4</spark.version>
+ <spark2.version>2.4.4</spark2.version>
Review comment:
I would suggest just keeping `spark.version` here. Override the
`spark.version` respectively in `hudi-spark2` and `hudi-spark3` modules.
##########
File path: hudi-spark2/src/main/java/org/apache/hudi/internal/DefaultSource.java
##########
@@ -18,7 +18,7 @@
package org.apache.hudi.internal;
-import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceUtilsForSpark2;
Review comment:
What is the need to move the `hudi datasource` itself to `hudi-spark2` ?
I think we should leave it under `hudi-spark` and later if we want to have
separate datasource implementations we can create separately under
`hudi-spark2` and `hudi-spark3` modules. Thoughts ?
##########
File path:
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieMergeOnReadTestUtils.java
##########
@@ -85,7 +85,7 @@
.collect(Collectors.toList()));
return inputPaths.stream().map(path -> {
- setInputPath(jobConf, path);
+ FileInputFormat.setInputPaths(jobConf, path);
Review comment:
As discussed internally regarding this in the code review, can you
confirm if this is actually converting paths to point to local file system and
not HDFS ? Also would be good to explain why you did this for reference in the
description.
##########
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
##########
@@ -96,4 +98,16 @@ object AvroConversionUtils {
val name = HoodieAvroUtils.sanitizeName(tableName)
(s"${name}_record", s"hoodie.${name}")
}
+
+ private def deserializeRow(encoder: ExpressionEncoder[Row], internalRow:
InternalRow): Row = {
+ // TODO remove reflection if Spark 2.x support is dropped
+ if (SPARK_VERSION.startsWith("2.")) {
Review comment:
+1 Lets have two separate implementations of the Row Deserializer for
spark 2 and spark 3, as was done in
https://github.com/apache/hudi/pull/1760/files
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]