crutis opened a new issue, #6281:
URL: https://github.com/apache/hudi/issues/6281
My job is just a wrapper around `HoodieDeltaStreamer` (yes, there are
probably better ways to do this).
```
public class SparkHudiPoc {
public static void main(String[] args) throws Exception {
HoodieDeltaStreamer.main(args);
}
}
```
From pom.xml:
```
<properties>
<!-- DEPENDENCY VERSIONS -->
<hudi.version>0.11.1</hudi.version>
<scala.version>2.12.10</scala.version>
<spark.version>3.1.2</spark.version>
<aws-java-sdk.version>1.12.257</aws-java-sdk.version>
<hadoop.version>3.2.1</hadoop.version>
<parquet.version>1.10.0</parquet.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hudi</groupId>
<artifactId>hudi-utilities-bundle_2.12</artifactId>
<version>${hudi.version}</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>${parquet.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>${aws-java-sdk.version}</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-sts</artifactId>
<version>${aws-java-sdk.version}</version>
</dependency>
</dependencies>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</dependencyManagement>
```
```
spark-submit
--master yarn
--deploy-mode client
s3://path-to/my-fat-jar.jar
--enable-sync
--disable-compaction
--sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
--min-sync-interval-seconds 60
--op UPSERT
--payload-class
org.apache.hudi.common.model.debezium.MySqlDebeziumAvroPayload
--source-class org.apache.hudi.utilities.sources.debezium.MysqlDebeziumSource
--source-ordering-field _event_origin_ts_ms
--table-type MERGE_ON_READ
--target-base-path s3://my-bucket/path/table_name
--target-table table_name
--continuous
--hoodie-conf auto.offset.reset=earliest
--hoodie-conf bootstrap.servers=kafka-server:9092
--hoodie-conf group.id=spark-hudi-poc
--hoodie-conf schema.registry.url=http://registry:8081
--hoodie-conf
hoodie.deltastreamer.schemaprovider.registry.url=http://registry:8081/subjects/CDC-value/versions/latest
--hoodie-conf hoodie.deltastreamer.source.kafka.topic=CDC
--hoodie-conf hoodie.datasource.hive_sync.database=spark-hudi-poc
--hoodie-conf hoodie.datasource.hive_sync.skip_ro_suffix=true
--hoodie-conf hoodie.datasource.hive_sync.table=table_name
--hoodie-conf hoodie.datasource.write.recordkey.field=id
--hoodie-conf hoodie.datasource.write.partitionpath.field=createdDate
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timezone=GMT
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS
```
**Environment Description**
* Hudi version : 0.11.1 (fat jar)
* EMR 6.5.0
* Spark version : 3.1.2
* Hive version : 3.1.2
* Hadoop version : Amazon 3.2.1
* Storage : S3
* Running on Docker? (yes/no) : no
**Stacktrace**
```
22/08/02 16:46:49 ERROR HoodieDeltaStreamer: Shutting down delta-sync due to
exception
org.apache.hudi.exception.HoodieException: Could not sync using the meta
sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
at
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:715)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:634)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:333)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:679)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception
when hive syncing mtrees_usertrees
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:143)
at
org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:59)
... 8 more
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync
partitions for table mtrees_usertrees
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:414)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:232)
at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:156)
at
org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:140)
... 9 more
Caused by: org.apache.hudi.aws.sync.HoodieGlueSyncException: Fail to add
partitions to spark-hudi-poc.mtrees_usertrees
at
org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.addPartitionsToTable(AWSGlueCatalogSyncClient.java:147)
at
org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:397)
... 12 more
Caused by: com.amazonaws.services.glue.model.InvalidInputException: The
number of partition keys do not match the number of partition values (Service:
AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID:
00f4d354-50a0-4b98-bce4-bab5569339c8; Proxy: null)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1862)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1415)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1384)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
at
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
at
com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
at
com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
at
com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
at
com.amazonaws.services.glue.AWSGlueClient.executeBatchCreatePartition(AWSGlueClient.java:259)
at
com.amazonaws.services.glue.AWSGlueClient.batchCreatePartition(AWSGlueClient.java:228)
at
org.apache.hudi.aws.sync.AWSGlueCatalogSyncClient.addPartitionsToTable(AWSGlueCatalogSyncClient.java:139)
... 13 more
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]