GitSpree23 opened a new issue #3346: URL: https://github.com/apache/hudi/issues/3346
**Describe the problem you faced** I'm getting a `Bad Request` response when trying to run a spark job to push data from remote Kafka source to S3 bucket using DeltaStreamer. I'm mostly following the latest Docker Setup documentation & S3 Filesystem doc on https://hudi.apache.org/ with the few changes mentioned below... **To Reproduce** Steps to reproduce the behaviour: (after following initial steps on https://hudi.apache.org/docs/docker_demo.html) 1. Comment out Kafka & Zookeeper containers in docker-compose_hadoop284_hive233_spark244.yml 2. Changed schema.avsc to match the incoming JSON. 3. Modified kafka-source.properties to: - add `group.id=hudi_test_group` - change bootstrap server to remote IP, topic, schemas & key & partition fields. 4. Replaced `stock_ticks` with `test1` in most files (except the ones containing the Hive queries) such as compaction*, hive-table-check.commands, sparksql-bootstrap-prep-source.commands, get_min_commit_time_cow.sh. 5. `./setup_demo.sh` 6. Create S3 bucket by same name as `target-table` & verify `aws s3 ls s3a://test` inside the adhoc-2 container. 7. `docker exec -it adhoc-2 /bin/bash` & run the following spark-submit cmd in docker container: ```spark-submit \ --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.endpoint=s3.ap-south-1.amazonaws.com \ --conf spark.hadoop.fs.s3a.access.key='A.A' \ --conf spark.hadoop.fs.s3a.secret.key='W.5' \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \ --table-type COPY_ON_WRITE \ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ --source-ordering-field cloud.account.id \ --target-base-path s3a://test/test1_cow \ --target-table test1_cow \ --props /var/demo/config/kafka-source.properties \ --hoodie-conf hoodie.datasource.write.recordkey.field=cloud.account.id \ --hoodie-conf hoodie.datasource.write.partitionpath.field=cloud.account.id \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider ``` **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : 0.8.0 * Spark version : 2.4.4 * Hive version : 2.3.3 * Hadoop version : 2.8.4 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : yes **Stacktrace** ``` 21/07/26 12:58:53 INFO SparkContext: Added JAR file:///root/.ivy2/jars/org.hamcrest_hamcrest-core-1.3.jar at spark://adhoc-2:38511/jars/org.hamcrest_hamcrest-core-1.3.jar with timestamp 1627304333052 21/07/26 12:58:53 INFO SparkContext: Added JAR file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar at spark://adhoc-2:38511/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar with timestamp 1627304333052 21/07/26 12:58:53 INFO SparkContext: Added JAR file:/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar at spark://adhoc-2:38511/jars/hoodie-utilities.jar with timestamp 1627304333052 21/07/26 12:58:53 INFO Executor: Starting executor ID driver on host localhost 21/07/26 12:58:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34463. 21/07/26 12:58:53 INFO NettyBlockTransferService: Server created on adhoc-2:34463 21/07/26 12:58:53 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 21/07/26 12:58:53 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, adhoc-2, 34463, None) 21/07/26 12:58:53 INFO BlockManagerMasterEndpoint: Registering block manager adhoc-2:34463 with 366.3 MB RAM, BlockManagerId(driver, adhoc-2, 34463, None) 21/07/26 12:58:53 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, adhoc-2, 34463, None) 21/07/26 12:58:53 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, adhoc-2, 34463, None) 21/07/26 12:58:55 INFO SparkUI: Stopped Spark web UI at http://adhoc-2:4040 21/07/26 12:58:55 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 21/07/26 12:58:55 INFO MemoryStore: MemoryStore cleared 21/07/26 12:58:55 INFO BlockManager: BlockManager stopped 21/07/26 12:58:55 INFO BlockManagerMaster: BlockManagerMaster stopped 21/07/26 12:58:55 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 21/07/26 12:58:55 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: ZVR44DYE, AWS Error Code: null, AWS Error Message: Bad Request, S3 Extended Request ID: GLMmbnyWTXS8= at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031) at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:106) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:501) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 21/07/26 12:58:55 INFO ShutdownHookManager: Shutdown hook called 21/07/26 12:58:55 INFO ShutdownHookManager: Deleting directory /tmp/spark-15bb32a0-6e2a-41d9-ad10-c90d84a92e4b 21/07/26 12:58:55 INFO ShutdownHookManager: Deleting directory /tmp/spark-c2357491-ac77-4394-8ed4-d6bce0c7cabd ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
