[GitHub] [hudi] jenu9417 opened a new issue, #7991: Higher number of S3 HEAD requests, while writing data to S3.

via GitHub Sat, 18 Feb 2023 22:51:57 -0800


jenu9417 opened a new issue, #7991:
URL: https://github.com/apache/hudi/issues/7991


   **Problem**
   
   We are working on ingesting data from Kafka onto S3 bucket, via 
HoodieDeltaStreamer tool. We use EMR v6.9.0 (Hudi - v0.12.1) for running 
HoodieDeltaStreamer tool. We have enabled hive sync partitions to glue.
   Writing data to kafka, partitioning, syncing metadata to glue, everything is 
happening well and good.
   But when we analyze the number of S3 Requests, we are seeing an abnormal 
increase in the number of HEAD requests to S3.
   Not sure what exactly is happening with such requests.
   When we analyzed S3 access logs, some of the HEAD requests were:
   ````
   HEAD /data/testfolder/.hoodie/20230216054705316.deltacommit HTTP/1.1" 404 
NoSuchKey
   HEAD 
/data/testfolder/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
 HTTP/1.1" 404 NoSuchKey
   HEAD /data/testfolder/.hoodie/metadata/.hoodie/hoodie.properties HTTP/1.1" 
200
   `````
   
   Can someone please help me understand what such requests are? Why are they 
made? Is there a way to optimize/reduce these requests? For context out of 100 
API requests, roughly 2/3 of them are HEAD requests.
   
   1) Please help us understand the HEAD requests and how we can reduce them
   
   Also, another query was:
   2) We use customkeygenerator to format partition value using timestamp based 
conversion. This works when we write directly to S3. When we enable hive sync 
with the same partition format ('datecreated:TIMESTAMP,tenant:SIMPLE') its 
throwing error that, such paritioning wont be supported in hive CREATE TABLE 
command (due to usage of ':' I guess). Is there any work around for this. Is it 
possible to use customkeygenerator partition values for Hive also.
   
   
   **To Reproduce**
   
   Use HoodieDeltaStreamer tool with Kafka as source and S3 as the target sink.
   
   Command We Run:
   
   ````
   spark-submit --jars 
/usr/lib/spark/jars/spark-avro.jar,/usr/lib/hudi/hudi-utilities-bundle.jar 
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field 
datecreated --table-type MERGE_ON_READ --target-table testfolder 
--target-base-path s3a://bucket/data/testfolder/ --source-limit 1000 
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider 
--hoodie-conf 
hoodie.deltastreamer.schemaprovider.source.schema.file=s3a://bucket/config/schema.avsc
 --hoodie-conf auto.offset.reset=earliest --hoodie-conf group.id=test-group 
--hoodie-conf bootstrap.servers=127.0.0.1:9092 --hoodie-conf 
hoodie.deltastreamer.source.kafka.topic=test --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
 --hoodie-conf hoodie.datasource.write
 .recordkey.field=sid --hoodie-conf 
hoodie.datasource.write.hive_style_partitioning=true --hoodie-conf 
hoodie.datasource.write.partitionpath.field='datecreated:TIMESTAMP,tenant:SIMPLE'
 --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING 
--hoodie-conf 
hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd'T'HH:mm:ss.SSSZ"
 --hoodie-conf 
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyyMMddHH 
--hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false 
--hoodie-conf hoodie.datasource.hive_sync.database=testinghudi --hoodie-conf 
hoodie.datasource.hive_sync.table=testfolder --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 --hoodie-conf 
hoodie.datasource.hive_sync.partition_fields='datecreated,tenant' --enable-sync
   ````
   
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.3
   
   * EMR version : 6.9.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No. Running on EMR using HoodieDeltaStreamer 
tool
   
   
   **Additional context**
   
   Found another issue, which was on similar lines, but had already been 
closed, without concluding on solution.
   
   https://github.com/apache/hudi/issues/2252
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jenu9417 opened a new issue, #7991: Higher number of S3 HEAD requests, while writing data to S3.

Reply via email to