codope opened a new pull request, #6327:
URL: https://github.com/apache/hudi/pull/6327
### Change Logs
- Shade `metrics-core` in `hudi-aws-bundle`
- Remove duplicate includes in other bundles
### Impact
Without this change, if Hudi metrics is turned on and metrics report type is
Cloudwatch, then write client initialization fails. Stacktrace in HUDI-4568.
The reason is that `metrics-core` is shaded in hudi-spark-bundle but not in
hudi-aws-bundle but this dependency is used in hudi-aws for which we get
NoMethodFoundError.
**Risk level: none | low | medium | high**
High
Run below script (with metrics and hive turned on). Without this fix, the
write will fail with NoMethodFoundError
```
./bin/pyspark \
--jars
/home/hadoop/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar,/home/hadoop/hudi-aws-bundle-0.13.0-SNAPSHOT.jar
\
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
\
--conf
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
sc.setLogLevel("WARN")
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
dataGen.generateInserts(10)
)
from pyspark.sql.functions import expr
df = spark.read.json(spark.sparkContext.parallelize(inserts, 10)).withColumn(
"part", expr("'foo'")
)
tableName = "test_hudi_pyspark2"
basePath = f"/tmp/{tableName}"
hudi_options = {
"hoodie.table.name": tableName,
"hoodie.datasource.write.recordkey.field": "uuid",
"hoodie.datasource.write.partitionpath.field": "part",
"hoodie.datasource.write.table.name": tableName,
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.upsert.shuffle.parallelism": 2,
"hoodie.insert.shuffle.parallelism": 2,
"hoodie.datasource.hive_sync.database": "default",
"hoodie.datasource.hive_sync.table": tableName,
"hoodie.datasource.hive_sync.mode": "hms",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_fields": "part",
"hoodie.datasource.hive_sync.partition_extractor_class":
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.metrics.on": "true",
"hoodie.metrics.reporter.type": "CLOUDWATCH"
}
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
```
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]