alexone95 opened a new issue, #8715:
URL: https://github.com/apache/hudi/issues/8715
Hello, we were facing the problem that hudi spends a lot of time by
requesting file in /archived directory, so in such a way to reduce this problem
we build up a solution consisting of daily deleting the files in the archive.
The solution works fine, in the way of reducing the commit latency, but from
when we deployed the solution we are facing the problem that the
REST.GET.BUCKET request increased a lot. In particular, from a single table we
get 1 milion of request per day of wich 900k are GET request for this path
/hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/.
We read INSERT, UPDATE and DELETE operation from a Kafka topic and we
replicate them in a target hudi table stored on Hive via a pyspark job running
24/7.
Why i get this behavior? there's something i can do in way to reduce the
number of requests?
**Environment Description**
Hudi version : 0.12.1-amzn-0
Spark version : 3.3.0
Hive version : 3.1.3
Hadoop version : 3.3.3 amz
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no (EMR 6.9.0)
**Additional context**
HOODIE TABLE PROPERTIES:
'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',
'hoodie.datasource.write.hive_style_partitioning':'true',
'hoodie.index.type':'GLOBAL_BLOOM',
'hoodie.simple.index.update.partition.path':'true',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.partition_extractor_class':
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.hive_sync.mode': 'hms',
'hoodie.copyonwrite.record.size.estimate':285,
'hoodie.parquet.small.file.limit': 104857600,
'hoodie.parquet.max.file.size': 120000000,
'hoodie.cleaner.commits.retained': 1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]