[GitHub] [hudi] yihua commented on a diff in pull request #6151: [HUDI-4429] Make Spark3.1 the default profile

GitBox Thu, 21 Jul 2022 13:12:03 -0700


yihua commented on code in PR #6151:
URL: https://github.com/apache/hudi/pull/6151#discussion_r927012448



##########
azure-pipelines.yml:
##########
@@ -89,10 +90,12 @@ stages:
     jobs:
       - job: UT_FT_1
         displayName: UT FT common & flink & UT client/spark-client
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '150'

Review Comment:
   Could you revert the unnecessary timeout change?



##########
azure-pipelines.yml:
##########
@@ -89,10 +90,12 @@ stages:
     jobs:
       - job: UT_FT_1
         displayName: UT FT common & flink & UT client/spark-client
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '150'
         steps:
           - task: Maven@3
             displayName: maven install
+            continueOnError: true
+            retryCountOnTaskFailure: 1

Review Comment:
   Remove this all similar changes?



##########
docker/compose/docker-compose_hadoop284_hive233_spark313.yml:
##########
@@ -0,0 +1,309 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+version: "3.3"
+
+services:
+
+  namenode:
+    image: rchertara/hudi-hadoop_2.8.4-namenode:image
+    hostname: namenode
+    container_name: namenode
+    environment:
+      - CLUSTER_NAME=hudi_hadoop284_hive232_spark313
+    ports:
+      - "50070:50070"
+      - "8020:8020"
+    env_file:
+      - ./hadoop.env
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://namenode:50070";]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  datanode1:
+    image: rchertara/hudi-hadoop_2.8.4-datanode:image
+    container_name: datanode1
+    hostname: datanode1
+    environment:
+      - CLUSTER_NAME=hudi_hadoop284_hive232_spark313
+    env_file:
+      - ./hadoop.env
+    ports:
+      - "50075:50075"
+      - "50010:50010"
+    links:
+      - "namenode"
+      - "historyserver"
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://datanode1:50075";]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+    depends_on:
+      - namenode
+
+  historyserver:
+    image: rchertara/hudi-hadoop_2.8.4-history:image
+    hostname: historyserver
+    container_name: historyserver
+    environment:
+      - CLUSTER_NAME=hudi_hadoop284_hive232_spark313
+    depends_on:
+      - "namenode"
+    links:
+      - "namenode"
+    ports:
+      - "58188:8188"
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://historyserver:8188";]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+    env_file:
+      - ./hadoop.env
+    volumes:
+      - historyserver:/hadoop/yarn/timeline
+
+  hive-metastore-postgresql:
+    image: bde2020/hive-metastore-postgresql:2.3.0
+    volumes:
+      - hive-metastore-postgresql:/var/lib/postgresql
+    hostname: hive-metastore-postgresql
+    container_name: hive-metastore-postgresql
+
+  hivemetastore:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3:image
+    hostname: hivemetastore
+    container_name: hivemetastore
+    links:
+      - "hive-metastore-postgresql"
+      - "namenode"
+    env_file:
+      - ./hadoop.env
+    command: /opt/hive/bin/hive --service metastore
+    environment:
+      SERVICE_PRECONDITION: "namenode:50070 hive-metastore-postgresql:5432"
+    ports:
+      - "9083:9083"
+    healthcheck:
+      test: ["CMD", "nc", "-z", "hivemetastore", "9083"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+    depends_on:
+      - "hive-metastore-postgresql"
+      - "namenode"
+
+  hiveserver:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3:image
+    hostname: hiveserver
+    container_name: hiveserver
+    env_file:
+      - ./hadoop.env
+    environment:
+      SERVICE_PRECONDITION: "hivemetastore:9083"
+    ports:
+      - "10000:10000"
+    depends_on:
+      - "hivemetastore"
+    links:
+      - "hivemetastore"
+      - "hive-metastore-postgresql"
+      - "namenode"
+    volumes:
+      - ${HUDI_WS}:/var/hoodie/ws
+
+  sparkmaster:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_3.1.3:image
+    hostname: sparkmaster
+    container_name: sparkmaster
+    env_file:
+      - ./hadoop.env
+    ports:
+      - "8080:8080"
+      - "7077:7077"
+    environment:
+      - INIT_DAEMON_STEP=setup_spark
+    links:
+      - "hivemetastore"
+      - "hiveserver"
+      - "hive-metastore-postgresql"
+      - "namenode"
+
+  spark-worker-1:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkworker_3.1.3:image
+    hostname: spark-worker-1
+    container_name: spark-worker-1
+    env_file:
+      - ./hadoop.env
+    depends_on:
+      - sparkmaster
+    ports:
+      - "8081:8081"
+    environment:
+      - "SPARK_MASTER=spark://sparkmaster:7077"
+    links:
+      - "hivemetastore"
+      - "hiveserver"
+      - "hive-metastore-postgresql"
+      - "namenode"
+
+  zookeeper:
+    image: 'bitnami/zookeeper:3.4.12-r68'
+    hostname: zookeeper
+    container_name: zookeeper
+    ports:
+      - "2181:2181"
+    environment:
+      - ALLOW_ANONYMOUS_LOGIN=yes
+
+  kafka:
+    image: 'bitnami/kafka:2.0.0'
+    hostname: kafkabroker
+    container_name: kafkabroker
+    ports:
+      - "9092:9092"
+    environment:
+      - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
+      - ALLOW_PLAINTEXT_LISTENER=yes
+
+  presto-coordinator-1:
+    container_name: presto-coordinator-1
+    hostname: presto-coordinator-1
+    image: rchertara/hudi-hadoop_2.8.4-prestobase_0.271:image
+    ports:
+      - "8090:8090"
+    environment:
+      - PRESTO_JVM_MAX_HEAP=512M
+      - PRESTO_QUERY_MAX_MEMORY=1GB
+      - PRESTO_QUERY_MAX_MEMORY_PER_NODE=256MB
+      - PRESTO_QUERY_MAX_TOTAL_MEMORY_PER_NODE=384MB
+      - PRESTO_MEMORY_HEAP_HEADROOM_PER_NODE=100MB
+      - TERM=xterm
+    links:
+      - "hivemetastore"
+    volumes:
+      - ${HUDI_WS}:/var/hoodie/ws
+    command: coordinator
+
+  presto-worker-1:
+    container_name: presto-worker-1
+    hostname: presto-worker-1
+    image: rchertara/hudi-hadoop_2.8.4-prestobase_0.271:image
+    depends_on: [ "presto-coordinator-1" ]
+    environment:
+      - PRESTO_JVM_MAX_HEAP=512M
+      - PRESTO_QUERY_MAX_MEMORY=1GB
+      - PRESTO_QUERY_MAX_MEMORY_PER_NODE=256MB
+      - PRESTO_QUERY_MAX_TOTAL_MEMORY_PER_NODE=384MB
+      - PRESTO_MEMORY_HEAP_HEADROOM_PER_NODE=100MB
+      - TERM=xterm
+    links:
+      - "hivemetastore"
+      - "hiveserver"
+      - "hive-metastore-postgresql"
+      - "namenode"
+    volumes:
+      - ${HUDI_WS}:/var/hoodie/ws
+    command: worker
+
+  trino-coordinator-1:
+    container_name: trino-coordinator-1
+    hostname: trino-coordinator-1
+    image: rchertara/hudi-hadoop_2.8.4-trinocoordinator_368:image
+    ports:
+      - "8091:8091"
+    links:
+      - "hivemetastore"
+    volumes:
+      - ${HUDI_WS}:/var/hoodie/ws
+    command: http://trino-coordinator-1:8091 trino-coordinator-1
+
+  trino-worker-1:
+    container_name: trino-worker-1
+    hostname: trino-worker-1
+    image: rchertara/hudi-hadoop_2.8.4-trinoworker_368:image
+    depends_on: [ "trino-coordinator-1" ]
+    ports:
+      - "8092:8092"
+    links:
+      - "hivemetastore"
+      - "hiveserver"
+      - "hive-metastore-postgresql"
+      - "namenode"
+    volumes:
+      - ${HUDI_WS}:/var/hoodie/ws
+    command: http://trino-coordinator-1:8091 trino-worker-1
+
+  graphite:
+    container_name: graphite
+    hostname: graphite
+    image: graphiteapp/graphite-statsd
+    ports:
+      - 80:80
+      - 2003-2004:2003-2004
+      - 8126:8126
+
+  adhoc-1:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_3.1.3:image
+    hostname: adhoc-1
+    container_name: adhoc-1
+    env_file:
+      - ./hadoop.env
+    depends_on:
+      - sparkmaster
+    ports:
+      - '4040:4040'
+    environment:
+      - "SPARK_MASTER=spark://sparkmaster:7077"
+    links:
+      - "hivemetastore"
+      - "hiveserver"
+      - "hive-metastore-postgresql"
+      - "namenode"
+      - "presto-coordinator-1"
+      - "trino-coordinator-1"
+    volumes:
+      - ${HUDI_WS}:/var/hoodie/ws
+
+  adhoc-2:
+    image: rchertara/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_3.1.3:image

Review Comment:
   if the images are finalized, let's upload the images to apachehudi docker 
account and change the reference here.



##########
docker/demo/config/log4j.properties:
##########
@@ -25,6 +25,8 @@ log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd 
HH:mm:ss} %p %c{1}:
 # log level for this class is used to overwrite the root logger's log level, 
so that
 # the user can have different defaults for the shell and regular Spark apps.
 log4j.logger.org.apache.spark.repl.Main=WARN
+# Adjust Hudi internal logging levels
+log4j.logger.org.apache.hudi=DEBUG

Review Comment:
   nit: remove this?



##########
hudi-examples/hudi-examples-flink/src/test/java/org/apache/hudi/examples/quickstart/TestHoodieFlinkQuickstart.java:
##########
@@ -34,6 +34,7 @@
 /**
  * IT cases for Hoodie table source and sink.
  */
+

Review Comment:
   nit: revert empty line?



##########
hudi-client/hudi-spark-client/pom.xml:
##########
@@ -174,6 +194,12 @@
       <artifactId>awaitility</artifactId>
       <scope>test</scope>
     </dependency>
+    <dependency>
+      <groupId>com.thoughtworks.paranamer</groupId>
+      <artifactId>paranamer</artifactId>
+      <version>2.8</version>
+      <scope>test</scope>
+    </dependency>

Review Comment:
   How is this introduced?  Does it have a compatible OSS license?



##########
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieCopyOnWriteTableInputFormat.java:
##########
@@ -76,6 +78,7 @@
  * NOTE: This class is invariant of the underlying file-format of the files 
being read
  */
 public class HoodieCopyOnWriteTableInputFormat extends HoodieTableInputFormat {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieCopyOnWriteTableInputFormat.class);

Review Comment:
   Is this still needed?



##########
packaging/hudi-spark-bundle/pom.xml:
##########
@@ -95,6 +95,12 @@
                   <include>org.antlr:stringtemplate</include>
                   <include>org.apache.parquet:parquet-avro</include>
 
+                  
<include>com.fasterxml.jackson.core:jackson-annotations</include>
+                  <include>com.fasterxml.jackson.core:jackson-core</include>
+                  
<include>com.fasterxml.jackson.core:jackson-databind</include>
+                  
<include>com.fasterxml.jackson.dataformat:jackson-dataformat-yaml</include>
+                  
<include>com.fasterxml.jackson.module:jackson-module-scala_${scala.binary.version}</include>

Review Comment:
   wondering why we add this?



##########
azure-pipelines.yml:
##########
@@ -200,27 +223,22 @@ stages:
               mavenOptions: '-Xmx4g'
       - job: IT
         displayName: IT modules
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '180'
         steps:
           - task: Maven@3
             displayName: maven install
+            continueOnError: true
+            retryCountOnTaskFailure: 2
             inputs:
               mavenPomFile: 'pom.xml'
               goals: 'clean install'
               options: $(MVN_OPTS_INSTALL) -Pintegration-tests
               publishJUnitResults: false
               jdkVersionOption: '1.8'
-          - task: Maven@3

Review Comment:
   Instead of deleting this, could you add a property to disable this task?  cc 
@xushiyan for help.



##########
azure-pipelines.yml:
##########
@@ -119,10 +126,12 @@ stages:
               mavenOptions: '-Xmx4g'
       - job: UT_FT_2
         displayName: FT client/spark-client
-        timeoutInMinutes: '120'
+        timeoutInMinutes: '150'

Review Comment:
   similar here and below.



##########
hudi-integ-test/prepare_integration_suite.sh:
##########
@@ -42,7 +42,7 @@ get_spark_command() {
   else
     scala=$scala
   fi
-  echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:2.4.4 \
+  echo "spark-submit --packages org.apache.spark:spark-avro_${scala}:3.1.3 \

Review Comment:
   `--packages org.apache.spark:spark-avro_${scala}:3.1.3 \` is no longer 
needed.  We should delete that.



##########
hudi-client/hudi-spark-client/pom.xml:
##########
@@ -48,10 +48,22 @@
     <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.hadoop</groupId>
+          <artifactId>hadoop-client</artifactId>
+        </exclusion>
+      </exclusions>
     </dependency>
     <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-sql_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>org.apache.orc</groupId>
+          <artifactId>orc-core</artifactId>
+        </exclusion>
+      </exclusions>

Review Comment:
   Will this break ORC support in Spark and Hudi?



##########
hudi-utilities/pom.xml:
##########
@@ -241,6 +245,17 @@
       </exclusions>
     </dependency>
 
+    <dependency>
+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-hive_${scala.binary.version}</artifactId>
+      <exclusions>
+        <exclusion>
+          <groupId>*</groupId>
+          <artifactId>*</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>

Review Comment:
   No point adding this since all artifacts are excluded?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a diff in pull request #6151: [HUDI-4429] Make Spark3.1 the default profile

Reply via email to