[GitHub] [hudi] umehrot2 commented on a diff in pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

GitBox Thu, 07 Jul 2022 13:59:43 -0700


umehrot2 commented on code in PR #5786:
URL: https://github.com/apache/hudi/pull/5786#discussion_r916237629



##########
pom.xml:
##########
@@ -1659,6 +1751,7 @@
         <hudi.spark.common.module>hudi-spark3-common</hudi.spark.common.module>
         <scalatest.version>${scalatest.spark3.version}</scalatest.version>
         <kafka.version>${kafka.spark3.version}</kafka.version>
+        <scalatest.version>3.1.0</scalatest.version>

Review Comment:
   This seems duplicate as it already defined above.



##########
pom.xml:
##########
@@ -1464,9 +1508,19 @@
           <plugin>
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-compiler-plugin</artifactId>
+            <version>${maven-compiler-plugin.version}</version>
             <configuration>
+
               <source>${java.version}</source>
               <target>${java.version}</target>
+              <compilerArgs>
+                <arg>-verbose</arg>

Review Comment:
   I don't think we should check in with `-verbose` to further increases the 
output logs.



##########
pom.xml:
##########
@@ -380,17 +391,24 @@
           <artifactId>maven-jar-plugin</artifactId>
           <version>${maven-jar-plugin.version}</version>
         </plugin>
+        <plugin>
+          <groupId>org.apache.maven.plugins</groupId>
+          <artifactId>maven-dependency-plugin</artifactId>
+          <version>${maven-dependency-plugin.version}</version>
+        </plugin>
         <plugin>
           <groupId>net.alchim31.maven</groupId>
           <artifactId>scala-maven-plugin</artifactId>
           <version>${scala-maven-plugin.version}</version>
           <configuration>
+            <recompileMode>all</recompileMode>

Review Comment:
   Curious to understand the reason for this ?



##########
pom.xml:
##########
@@ -1464,9 +1508,19 @@
           <plugin>
             <groupId>org.apache.maven.plugins</groupId>
             <artifactId>maven-compiler-plugin</artifactId>
+            <version>${maven-compiler-plugin.version}</version>
             <configuration>
+

Review Comment:
   nit: unnecessary line



##########
packaging/hudi-timeline-server-bundle/pom.xml:
##########
@@ -102,6 +102,10 @@
                     <groupId>javax.servlet</groupId>
                     <artifactId>*</artifactId>
                 </exclusion>
+                <exclusion>
+                    <groupId>org.eclipse.jetty</groupId>
+                    <artifactId>*</artifactId>
+                </exclusion>

Review Comment:
   Same here. Why not just exclude this while defining the dependency in the 
root POM instead of individually excluding it at all places ? This will not be 
manageable in the long run.



##########
hudi-utilities/pom.xml:
##########
@@ -215,6 +216,14 @@
           <groupId>javax.servlet</groupId>
           <artifactId>*</artifactId>
         </exclusion>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client-api</artifactId>
+        </exclusion>

Review Comment:
   Any dependency that we are excluding in sub modules, please review whether 
we can simply exclude it in the root POM itself. It will avoid these random 
code changes throughout the project.



##########
hudi-utilities/pom.xml:
##########
@@ -39,9 +39,10 @@
       <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
+        <version>${maven-compiler-plugin.version}</version>
         <configuration>
-          <source>1.8</source>
-          <target>1.8</target>
+          <source>${java.version}</source>
+          <target>${java.version}</target>
         </configuration>

Review Comment:
   Does this even need to be redefined here ? Its already defined in root pom.



##########
hudi-utilities/pom.xml:
##########
@@ -233,6 +242,17 @@
       </exclusions>
     </dependency>
 
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-hive_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>*</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>

Review Comment:
   Why is this dependency needed here ?



##########
hudi-spark-datasource/hudi-spark3/pom.xml:
##########
@@ -262,7 +261,7 @@
 
     <dependency>
       <groupId>org.apache.hudi</groupId>
-      <artifactId>hudi-spark3-common</artifactId>
+      <artifactId>${hudi.spark.common.module}</artifactId>

Review Comment:
   Should this not be `hudi.spark3.common.module` ? Otherwise it may use the 
wrong module when you compile with `spark2` profile.



##########
pom.xml:
##########
@@ -1669,15 +1762,17 @@
         <fasterxml.jackson.dataformat.yaml.version>${fasterxml.spark3.version}
         </fasterxml.jackson.dataformat.yaml.version>
         <skip.hudi-spark2.unit.tests>true</skip.hudi-spark2.unit.tests>
-        <skipITs>true</skipITs>
+        <skipITs>false</skipITs>
       </properties>
       <modules>
         <module>hudi-spark-datasource/hudi-spark3</module>
         <module>hudi-spark-datasource/hudi-spark3-common</module>
       </modules>
       <activation>
+        <activeByDefault>true</activeByDefault>
         <property>
           <name>spark3</name>
+          <value>!disabled</value>

Review Comment:
   Not sure I understand this. Is this still a valid problem or can this 
comment be marked resolved ?



##########
packaging/hudi-utilities-bundle/pom.xml:
##########
@@ -409,6 +409,12 @@
       <artifactId>hive-service</artifactId>
       <version>${hive.version}</version>
       <scope>${utilities.bundle.hive.scope}</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.eclipse.jetty</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   You have already excluded jetty while defining the dependencies in root POM. 
You should not have to do that individually in each and every POM file now.



##########
pom.xml:
##########
@@ -110,19 +113,21 @@
     <log4j.test.version>2.17.2</log4j.test.version>
     <slf4j.version>1.7.30</slf4j.version>
     <joda.version>2.9.9</joda.version>
-    <hadoop.version>2.10.1</hadoop.version>
+    <hadoop.version>3.1.0</hadoop.version>

Review Comment:
   I can't remember if its a conscious decision that we already discussed, but 
whats the reason for not upgrading hadoop to 3.2.1 instead ?



##########
pom.xml:
##########
@@ -78,28 +78,31 @@
 
   <properties>
     <maven-jar-plugin.version>3.2.0</maven-jar-plugin.version>
+    <maven-dependency-plugin.version>3.3.0</maven-dependency-plugin.version>
     <maven-surefire-plugin.version>2.22.2</maven-surefire-plugin.version>
     <maven-failsafe-plugin.version>2.22.2</maven-failsafe-plugin.version>
     <maven-shade-plugin.version>3.2.4</maven-shade-plugin.version>
     <maven-javadoc-plugin.version>3.1.1</maven-javadoc-plugin.version>
-    <maven-compiler-plugin.version>3.8.0</maven-compiler-plugin.version>
+    <maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
     <maven-deploy-plugin.version>2.4</maven-deploy-plugin.version>
     <genjavadoc-plugin.version>0.15</genjavadoc-plugin.version>
     <build-helper-maven-plugin.version>1.7</build-helper-maven-plugin.version>
     <maven-enforcer-plugin.version>3.0.0-M1</maven-enforcer-plugin.version>
     <maven-docker-plugin.version>0.37.0</maven-docker-plugin.version>
 
     <java.version>1.8</java.version>
-    <fasterxml.version>2.6.7</fasterxml.version>
-    
<fasterxml.jackson.databind.version>2.6.7.3</fasterxml.jackson.databind.version>
-    
<fasterxml.jackson.module.scala.version>2.6.7.1</fasterxml.jackson.module.scala.version>
-    
<fasterxml.jackson.dataformat.yaml.version>2.7.4</fasterxml.jackson.dataformat.yaml.version>
+    <fasterxml.version>${fasterxml.spark3.version}</fasterxml.version>
+    
<fasterxml.jackson.databind.version>${fasterxml.spark3.version}</fasterxml.jackson.databind.version>
+    
<fasterxml.jackson.module.scala.version>${fasterxml.spark3.version}</fasterxml.jackson.module.scala.version>
+    
<fasterxml.jackson.dataformat.yaml.version>${fasterxml.spark3.version}</fasterxml.jackson.dataformat.yaml.version>
     <fasterxml.spark3.version>2.10.0</fasterxml.spark3.version>
-    <kafka.version>2.0.0</kafka.version>
-    <kafka.spark3.version>2.4.1</kafka.spark3.version>
+    <kafka.version>${kafka.spark3.version}</kafka.version>
+    <kafka.spark2.version>2.0.0</kafka.spark2.version>
+    <kafka.spark3.version>2.8.0</kafka.spark3.version>
     <pulsar.version>2.8.1</pulsar.version>
     <confluent.version>5.3.4</confluent.version>
     <glassfish.version>2.17</glassfish.version>
+    <parquet.version>1.12.1</parquet.version>

Review Comment:
   This should be 1.12.2 right ? Atleast thats what I see being used on the 
master already for Spark 3.



##########
packaging/hudi-integ-test-bundle/pom.xml:
##########
@@ -482,6 +482,12 @@
       <artifactId>hadoop-hdfs</artifactId>
       <classifier>tests</classifier>
       <scope>test</scope>
+      <exclusions>
+        <exclusion>
+          <groupId>org.eclipse.jetty</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   Okay, so a general high level comment here. Throughout the PR there are 
jetty dependency exclusions added which is creating a lot of unnecessary diff. 
Ideally this should all just be managed through the root POMs 
`dependencyManagement` instead of excluding it everywhere. We should clean up 
these diffs from the PR.



##########
packaging/hudi-hadoop-mr-bundle/pom.xml:
##########
@@ -29,6 +29,8 @@
   <properties>
     <checkstyle.skip>true</checkstyle.skip>
     <main.basedir>${project.parent.basedir}</main.basedir>
+    <!-- override parquet version to be same as Hive 3.1.2 -->
+    <parquet.version>1.10.1</parquet.version>

Review Comment:
   Can we instead add like `hive.parquet.version` in the root POM and use that 
for the parquet dependencies here ? Changing the version in the internal POMs 
like this becomes hard to track. If we atleast have a property defined around 
this, it will be easier to remember in future that we are dealing with multiple 
parquet versions in the project.



##########
hudi-spark-datasource/hudi-spark/pom.xml:
##########
@@ -202,6 +202,12 @@
       <groupId>org.apache.hudi</groupId>
       <artifactId>hudi-common</artifactId>
       <version>${project.version}</version>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hive</groupId>
+          <artifactId>hive-storage-api</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   I don't think you need to do this exclusion. You are already explicitly 
defining this dependency in this POM fixed to a specific version.



##########
hudi-flink-datasource/hudi-flink/pom.xml:
##########
@@ -45,8 +45,8 @@
                 <groupId>org.apache.maven.plugins</groupId>
                 <artifactId>maven-compiler-plugin</artifactId>
                 <configuration>
-                    <source>1.8</source>
-                    <target>1.8</target>
+                    <source>${java.version}</source>
+                    <target>${java.version}</target>

Review Comment:
   Can we avoid these changes ? Just having it in root pom should be sufficient.



##########
hudi-timeline-service/pom.xml:
##########
@@ -117,6 +123,28 @@
       <artifactId>rocksdbjni</artifactId>
     </dependency>
 
+    <!-- Jetty -->
+    <dependency>
+      <groupId>org.eclipse.jetty</groupId>
+      <artifactId>jetty-server</artifactId>
+      <version>${jetty.version}</version>
+    </dependency>

Review Comment:
   How was this module getting the jetty dependencies earlier ? Was it 
happening transitively ?
   
   Also no need to mention `jetty.version` while defining again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] umehrot2 commented on a diff in pull request #5786: [HUDI-2955] Support Hadoop 3.x Hive 3.x and Spark 3.2.x default (rebase)

Reply via email to