ghrahul opened a new issue, #11126: URL: https://github.com/apache/hudi/issues/11126
**Problem** We were running `Spark 3.2.1` along with `HUDI 0.11.1`. The jar link is: https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.11.1/hudi-spark3.2-bundle_2.12-0.11.1.jar. We want to upgrade Spark to `Spark 3.4.1` and HUDI to `HUDI 0.14.0`. The jar link is: https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar The HUDI configuration that we are using for both the above cases for a sample table is: ``` hudi_config = { 'className': 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false', 'hoodie.datasource.write.precombine.field': 'transact_id', 'hoodie.datasource.write.recordkey.field': 'id,op', 'hoodie.table.name': 'users_masteruser_spark_test_v9', 'hoodie.consistency.check.enabled': 'false', 'hoodie.datasource.hive_sync.table': 'users_masteruser_spark_test_v9', 'hoodie.datasource.hive_sync.database': 'lake_luna_for_payments', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.support_timestamp': 'true', 'hoodie.datasource.write.reconcile.schema': 'true', 'path': 's3a://trusted-luna-sbox/luna_for_payments/public/users_masteruser_spark_test_v9', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.write.partitionpath.field': '', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.partition_fields': '', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.upsert.shuffle.parallelism': 40, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 'hoodie.cleaner.commits.retained': 1 } ``` Step 1: Loading the table using `Spark 3.2.1` along with `HUDI 0.11.1`. This is how all of our pipelines work. Step 2: We want to append data in the same table using `Spark 3.4.1` along with `HUDI 0.14.0`. Here is where we are getting an **error**: ``` ArrayIndexOutOfBoundsException Traceback (most recent call last) Cell In[26], line 1 ----> 1 df.write.format("org.apache.hudi").mode('append').options(**hudi_config).save('s3a://trusted-luna-sbox/luna_for_payments/public/users_masteruser_spark_test_v9') File /opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py:1398, in DataFrameWriter.save(self, path, format, mode, partitionBy, **options) 1396 self._jwrite.save() 1397 else: -> 1398 self._jwrite.save(path) File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args) 1316 command = proto.CALL_COMMAND_NAME +\ 1317 self.command_header +\ 1318 args_command +\ 1319 proto.END_COMMAND_PART 1321 answer = self.gateway_client.send_command(command) -> 1322 return_value = get_return_value( 1323 answer, self.gateway_client, self.target_id, self.name) 1325 for temp_arg in temp_args: 1326 if hasattr(temp_arg, "_detach"): File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:175, in capture_sql_exception.<locals>.deco(*a, **kw) 171 converted = convert_exception(e.java_exception) 172 if not isinstance(converted, UnknownException): 173 # Hide where the exception came from that shows a non-Pythonic 174 # JVM exception message. --> 175 raise converted from None 176 else: 177 raise ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0 ``` **Other Info:** **For Spark3.4.1** We are using this as base spark image: `public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21` And these jars: ``` RUN wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.6.0/postgresql-42.6.0.jar && \ wget https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar && \ wget https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.3.0/mysql-connector-j-8.3.0.jar && \ mv postgresql-42.6.0.jar /opt/spark/jars && \ mv hudi-spark3.4-bundle_2.12-0.14.0.jar /opt/spark/jars && \ mv mysql-connector-j-8.3.0.jar /opt/spark/jars ``` **For Spark 3.2.1** Base Image: `linux/amd64 gcr.io/datamechanics/spark:platform-3.2.1-hadoop-3.3.1-java-11-scala-2.12-python-3.8-dm18` Jars: ``` RUN wget https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.7/postgresql-42.3.7.jar && \ wget https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.11.1/hudi-spark3.2-bundle_2.12-0.11.1.jar && \ wget https://repo1.maven.org/maven2/org/apache/hive/hcatalog/hive-hcatalog-core/3.1.3/hive-hcatalog-core-3.1.3.jar && \ wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar && \ mv postgresql-42.3.7.jar /opt/spark/jars && \ mv hudi-spark3.2-bundle_2.12-0.11.1.jar /opt/spark/jars && \ mv hive-hcatalog-core-3.1.3.jar /opt/spark/jars && \ mv mysql-connector-java-8.0.30.jar /opt/spark/jars ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
