jenu9417 opened a new issue, #7991:
URL: https://github.com/apache/hudi/issues/7991
**Problem**
We are working on ingesting data from Kafka onto S3 bucket, via
HoodieDeltaStreamer tool. We use EMR v6.9.0 (Hudi - v0.12.1) for running
HoodieDeltaStreamer tool. We have enabled hive sync partitions to glue.
Writing data to kafka, partitioning, syncing metadata to glue, everything is
happening well and good.
But when we analyze the number of S3 Requests, we are seeing an abnormal
increase in the number of HEAD requests to S3.
Not sure what exactly is happening with such requests.
When we analyzed S3 access logs, some of the HEAD requests were:
````
HEAD /data/testfolder/.hoodie/20230216054705316.deltacommit HTTP/1.1" 404
NoSuchKey
HEAD
/data/testfolder/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile
HTTP/1.1" 404 NoSuchKey
HEAD /data/testfolder/.hoodie/metadata/.hoodie/hoodie.properties HTTP/1.1"
200
`````
Can someone please help me understand what such requests are? Why are they
made? Is there a way to optimize/reduce these requests? For context out of 100
API requests, roughly 2/3 of them are HEAD requests.
1) Please help us understand the HEAD requests and how we can reduce them
Also, another query was:
2) We use customkeygenerator to format partition value using timestamp based
conversion. This works when we write directly to S3. When we enable hive sync
with the same partition format ('datecreated:TIMESTAMP,tenant:SIMPLE') its
throwing error that, such paritioning wont be supported in hive CREATE TABLE
command (due to usage of ':' I guess). Is there any work around for this. Is it
possible to use customkeygenerator partition values for Hive also.
**To Reproduce**
Use HoodieDeltaStreamer tool with Kafka as source and S3 as the target sink.
Command We Run:
````
spark-submit --jars
/usr/lib/spark/jars/spark-avro.jar,/usr/lib/hudi/hudi-utilities-bundle.jar
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --class
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
/usr/lib/hudi/hudi-utilities-bundle.jar --source-class
org.apache.hudi.utilities.sources.JsonKafkaSource --source-ordering-field
datecreated --table-type MERGE_ON_READ --target-table testfolder
--target-base-path s3a://bucket/data/testfolder/ --source-limit 1000
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--hoodie-conf
hoodie.deltastreamer.schemaprovider.source.schema.file=s3a://bucket/config/schema.avsc
--hoodie-conf auto.offset.reset=earliest --hoodie-conf group.id=test-group
--hoodie-conf bootstrap.servers=127.0.0.1:9092 --hoodie-conf
hoodie.deltastreamer.source.kafka.topic=test --hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
--hoodie-conf hoodie.datasource.write
.recordkey.field=sid --hoodie-conf
hoodie.datasource.write.hive_style_partitioning=true --hoodie-conf
hoodie.datasource.write.partitionpath.field='datecreated:TIMESTAMP,tenant:SIMPLE'
--hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd'T'HH:mm:ss.SSSZ"
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyyMMddHH
--hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false
--hoodie-conf hoodie.datasource.hive_sync.database=testinghudi --hoodie-conf
hoodie.datasource.hive_sync.table=testfolder --hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
--hoodie-conf
hoodie.datasource.hive_sync.partition_fields='datecreated,tenant' --enable-sync
````
**Environment Description**
* Hudi version : 0.12.1
* Spark version : 3.3.0
* Hive version : 3.1.3
* Hadoop version : 3.3.3
* EMR version : 6.9.0
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No. Running on EMR using HoodieDeltaStreamer
tool
**Additional context**
Found another issue, which was on similar lines, but had already been
closed, without concluding on solution.
https://github.com/apache/hudi/issues/2252
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]