GitSpree23 opened a new issue #3346:
URL: https://github.com/apache/hudi/issues/3346


   **Describe the problem you faced**
   
   I'm getting a `Bad Request` response when trying to run a spark job to push 
data from remote Kafka source to S3 bucket using DeltaStreamer. I'm mostly 
following the latest Docker Setup documentation & S3 Filesystem doc on 
https://hudi.apache.org/ with the few changes mentioned below...
   
   **To Reproduce**
   
   Steps to reproduce the behaviour: (after following initial steps on 
https://hudi.apache.org/docs/docker_demo.html)
   
   1. Comment out Kafka & Zookeeper containers in 
docker-compose_hadoop284_hive233_spark244.yml
   2. Changed schema.avsc to match the incoming JSON.
   3. Modified kafka-source.properties to:
       - add `group.id=hudi_test_group`
       - change bootstrap server to remote IP, topic, schemas & key & partition 
fields.
   4. Replaced `stock_ticks` with `test1` in most files (except the ones 
containing the Hive queries) such as compaction*, hive-table-check.commands, 
sparksql-bootstrap-prep-source.commands, get_min_commit_time_cow.sh.
   5. `./setup_demo.sh`
   6. Create S3 bucket by same name as `target-table` & verify `aws s3 ls 
s3a://test` inside the adhoc-2 container.
   7. `docker exec -it adhoc-2 /bin/bash` & run the following spark-submit cmd 
in docker container:
   ```spark-submit  \
       --packages 
com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
       --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
       --conf spark.hadoop.fs.s3a.endpoint=s3.ap-south-1.amazonaws.com \
       --conf spark.hadoop.fs.s3a.access.key='A.A' \
       --conf spark.hadoop.fs.s3a.secret.key='W.5' \
       --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
$HUDI_UTILITIES_BUNDLE \
       --table-type COPY_ON_WRITE \
       --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
       --source-ordering-field cloud.account.id \
       --target-base-path s3a://test/test1_cow \
       --target-table test1_cow \
       --props /var/demo/config/kafka-source.properties \
       --hoodie-conf hoodie.datasource.write.recordkey.field=cloud.account.id \
       --hoodie-conf 
hoodie.datasource.write.partitionpath.field=cloud.account.id \
       --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
   ```
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.3
   
   * Hadoop version : 2.8.4
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes
   
   
   **Stacktrace**
   
   ```
   21/07/26 12:58:53 INFO SparkContext: Added JAR 
file:///root/.ivy2/jars/org.hamcrest_hamcrest-core-1.3.jar at 
spark://adhoc-2:38511/jars/org.hamcrest_hamcrest-core-1.3.jar with timestamp 
1627304333052
   21/07/26 12:58:53 INFO SparkContext: Added JAR 
file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar at 
spark://adhoc-2:38511/jars/com.fasterxml.jackson.core_jackson-core-2.2.3.jar 
with timestamp 1627304333052
   21/07/26 12:58:53 INFO SparkContext: Added JAR 
file:/var/hoodie/ws/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar 
at spark://adhoc-2:38511/jars/hoodie-utilities.jar with timestamp 1627304333052
   21/07/26 12:58:53 INFO Executor: Starting executor ID driver on host 
localhost
   21/07/26 12:58:53 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 34463.
   21/07/26 12:58:53 INFO NettyBlockTransferService: Server created on 
adhoc-2:34463
   21/07/26 12:58:53 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
   21/07/26 12:58:53 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, adhoc-2, 34463, None)
   21/07/26 12:58:53 INFO BlockManagerMasterEndpoint: Registering block manager 
adhoc-2:34463 with 366.3 MB RAM, BlockManagerId(driver, adhoc-2, 34463, None)
   21/07/26 12:58:53 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(driver, adhoc-2, 34463, None)
   21/07/26 12:58:53 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, adhoc-2, 34463, None)
   21/07/26 12:58:55 INFO SparkUI: Stopped Spark web UI at http://adhoc-2:4040
   21/07/26 12:58:55 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
   21/07/26 12:58:55 INFO MemoryStore: MemoryStore cleared
   21/07/26 12:58:55 INFO BlockManager: BlockManager stopped
   21/07/26 12:58:55 INFO BlockManagerMaster: BlockManagerMaster stopped
   21/07/26 12:58:55 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   21/07/26 12:58:55 INFO SparkContext: Successfully stopped SparkContext
   Exception in thread "main" 
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS 
Service: Amazon S3, AWS Request ID: ZVR44DYE, AWS Error Code: null, AWS Error 
Message: Bad Request, S3 Extended Request ID: GLMmbnyWTXS8=
        at 
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
        at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
        at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
        at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.hudi.common.fs.FSUtils.getFs(FSUtils.java:97)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:106)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:501)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   21/07/26 12:58:55 INFO ShutdownHookManager: Shutdown hook called
   21/07/26 12:58:55 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-15bb32a0-6e2a-41d9-ad10-c90d84a92e4b
   21/07/26 12:58:55 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-c2357491-ac77-4394-8ed4-d6bce0c7cabd
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to