[I] [SUPPORT] upgrade from 0.10.0 to 0.14.0 [hudi]

via GitHub Tue, 30 Apr 2024 03:31:37 -0700


ghrahul opened a new issue, #11126:
URL: https://github.com/apache/hudi/issues/11126


   
   **Problem**
   
   We were running `Spark 3.2.1` along with `HUDI 0.11.1`. The jar link is: 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.11.1/hudi-spark3.2-bundle_2.12-0.11.1.jar.
   
   We want to upgrade Spark to `Spark 3.4.1` and HUDI to `HUDI 0.14.0`. The jar 
link is: 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar
   
   The HUDI configuration that we are using for both the above cases for a 
sample table is:
   
   ```
   hudi_config = {
       
                  'className': 'org.apache.hudi',
                  'hoodie.datasource.hive_sync.use_jdbc': 'false',
                  'hoodie.datasource.write.precombine.field': 'transact_id',
                  'hoodie.datasource.write.recordkey.field': 'id,op',
                  'hoodie.table.name': 'users_masteruser_spark_test_v9',
                  'hoodie.consistency.check.enabled': 'false',
                  'hoodie.datasource.hive_sync.table': 
'users_masteruser_spark_test_v9',
                  'hoodie.datasource.hive_sync.database': 
'lake_luna_for_payments',
                  'hoodie.datasource.hive_sync.enable': 'true',
                  'hoodie.datasource.hive_sync.mode': 'hms',
                  'hoodie.datasource.hive_sync.support_timestamp': 'true',
                  'hoodie.datasource.write.reconcile.schema': 'true',
                  'path': 
's3a://trusted-luna-sbox/luna_for_payments/public/users_masteruser_spark_test_v9',
                  'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
                  'hoodie.datasource.write.partitionpath.field': '',
                  'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
                  'hoodie.datasource.hive_sync.partition_fields': '',
                  'hoodie.datasource.write.hive_style_partitioning': 'true',
                  'hoodie.upsert.shuffle.parallelism': 40,
                  'hoodie.datasource.write.operation': 'upsert',
                  'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS',
                  'hoodie.cleaner.commits.retained': 1
       
                 }
   
   ```
   
   Step 1: Loading the table using `Spark 3.2.1` along with `HUDI 0.11.1`. This 
is how all of our pipelines work.
   
   Step 2: We want to append data in the same table using `Spark 3.4.1` along 
with `HUDI 0.14.0`. Here is where we are getting an 
   **error**:
   ```
   ArrayIndexOutOfBoundsException            Traceback (most recent call last)
   Cell In[26], line 1
   ----> 1 
df.write.format("org.apache.hudi").mode('append').options(**hudi_config).save('s3a://trusted-luna-sbox/luna_for_payments/public/users_masteruser_spark_test_v9')
   
   File /opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py:1398, in 
DataFrameWriter.save(self, path, format, mode, partitionBy, **options)
      1396     self._jwrite.save()
      1397 else:
   -> 1398     self._jwrite.save(path)
   
   File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, 
in JavaMember.__call__(self, *args)
      1316 command = proto.CALL_COMMAND_NAME +\
      1317     self.command_header +\
      1318     args_command +\
      1319     proto.END_COMMAND_PART
      1321 answer = self.gateway_client.send_command(command)
   -> 1322 return_value = get_return_value(
      1323     answer, self.gateway_client, self.target_id, self.name)
      1325 for temp_arg in temp_args:
      1326     if hasattr(temp_arg, "_detach"):
   
   File 
/opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:175, in 
capture_sql_exception.<locals>.deco(*a, **kw)
       171 converted = convert_exception(e.java_exception)
       172 if not isinstance(converted, UnknownException):
       173     # Hide where the exception came from that shows a non-Pythonic
       174     # JVM exception message.
   --> 175     raise converted from None
       176 else:
       177     raise
   
   ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0
   
   ```
   
   **Other Info:**
   **For Spark3.4.1**
   
   We are using this as base spark image: 
`public.ecr.aws/ocean-spark/spark:platform-3.4.1-hadoop-3.3.4-java-11-scala-2.12-python-3.10-gen21`
   And these jars:
   ```
   RUN wget 
https://repo1.maven.org/maven2/org/postgresql/postgresql/42.6.0/postgresql-42.6.0.jar
 && \
       wget 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar
 && \
       wget 
https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.3.0/mysql-connector-j-8.3.0.jar
 && \
       mv postgresql-42.6.0.jar /opt/spark/jars && \
       mv hudi-spark3.4-bundle_2.12-0.14.0.jar /opt/spark/jars && \
       mv mysql-connector-j-8.3.0.jar /opt/spark/jars
   
   ```
   
   **For Spark 3.2.1**
   
   Base Image: `linux/amd64 
gcr.io/datamechanics/spark:platform-3.2.1-hadoop-3.3.1-java-11-scala-2.12-python-3.8-dm18`
   Jars:
   ```
   RUN wget 
https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.7/postgresql-42.3.7.jar
 && \
       wget 
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.2-bundle_2.12/0.11.1/hudi-spark3.2-bundle_2.12-0.11.1.jar
 && \
       wget 
https://repo1.maven.org/maven2/org/apache/hive/hcatalog/hive-hcatalog-core/3.1.3/hive-hcatalog-core-3.1.3.jar
 && \
       wget 
https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar
 && \
            mv postgresql-42.3.7.jar /opt/spark/jars && \
            mv hudi-spark3.2-bundle_2.12-0.11.1.jar /opt/spark/jars && \
            mv hive-hcatalog-core-3.1.3.jar /opt/spark/jars && \
            mv mysql-connector-java-8.0.30.jar /opt/spark/jars
   
   ```
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] upgrade from 0.10.0 to 0.14.0 [hudi]

Reply via email to