alexone95 opened a new issue, #8715:
URL: https://github.com/apache/hudi/issues/8715

   Hello, we were facing the problem that hudi spends a lot of time by 
requesting file in /archived directory, so in such a way to reduce this problem 
we build up a solution consisting of daily deleting the files in the archive. 
The solution works fine, in the way of reducing the commit latency, but from 
when we deployed the solution we are facing the problem that the 
REST.GET.BUCKET request increased a lot. In particular, from a single table we 
get 1 milion of request per day of wich 900k are GET request for this path 
/hudiTable/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/.
   
   We read INSERT, UPDATE and DELETE operation from a Kafka topic and we 
replicate them in a target hudi table stored on Hive via a pyspark job running 
24/7.
   
   Why i get this behavior? there's something i can do in way to reduce the 
number of requests?
   
   **Environment Description**
   
       Hudi version : 0.12.1-amzn-0
       Spark version : 3.3.0
       Hive version : 3.1.3
       Hadoop version : 3.3.3 amz
       Storage (HDFS/S3/GCS..) : S3
       Running on Docker? (yes/no) : no (EMR 6.9.0)
   
   **Additional context**
   
   HOODIE TABLE PROPERTIES:
   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
   'hoodie.datasource.write.hive_style_partitioning':'true',
   'hoodie.index.type':'GLOBAL_BLOOM',
   'hoodie.simple.index.update.partition.path':'true',
   'hoodie.datasource.hive_sync.enable': 'true',
   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
   'hoodie.datasource.hive_sync.use_jdbc': 'false',
   'hoodie.datasource.hive_sync.mode': 'hms',
   'hoodie.copyonwrite.record.size.estimate':285,
   'hoodie.parquet.small.file.limit': 104857600,
   'hoodie.parquet.max.file.size': 120000000,
   'hoodie.cleaner.commits.retained': 1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to