[GitHub] [hudi] VitoMakarevich opened a new issue, #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

via GitHub Mon, 23 Jan 2023 06:24:22 -0800


VitoMakarevich opened a new issue, #7734:
URL: https://github.com/apache/hudi/issues/7734


   **Describe the problem you faced**
   
   Hello, recently we updated the hudi version from 0.11.0 to 0.12.1, after 
that we saw performance degradation, but since we have no clear reproduction, 
at the moment we want to check things we see in fact. So, one of the things is 
that we see s3 rates grow significantly(few orders). Only head/get counts are 
increased, the rest looks the same(post/list/delete). Also, the bytes 
downloaded look the same. I'm now checking which calls are most frequent(but we 
could not compare now because didn't collect that granular data before). I 
suspect some bloom-filter issues that lead to loading the same data more & 
more, but I'm not very familiar to be sure. I also suspected failed tasks to be 
the reason, but we have a relatively low amount(and had before).
   <img width="1345" alt="image" 
src="https://user-images.githubusercontent.com/15978165/214061278-59628cd8-9106-46c0-969c-4198fb33b877.png";>
   Our spark settings are
   `
           "hoodie.datasource.write.table.type" = "COPY_ON_WRITE"
           "hoodie.datasource.write.recordkey.field" = "hkey"
           "hoodie.datasource.write.precombine.field" = "hkey"
           "hoodie.datasource.write.partitionpath.field" = "root_account_uuid"
           "hoodie.datasource.write.drop.partition.columns" = "true"
           "hoodie.datasource.write.hive_style_partitioning" = "true"
           "hoodie.finalize.write.parallelism" = "200"
           "hoodie.upsert.shuffle.parallelism" = "200"
           "hoodie.insert.shuffle.parallelism" = "200"
           "hoodie.bulkinsert.shuffle.parallelism" = "200"
           "hoodie.compact.inline" = "false"
           "hoodie.clean.automatic" = "true"
           "hoodie.cleaner.policy" = "KEEP_LATEST_BY_HOURS"
           "hoodie.cleaner.hours.retained" = "12"
           "hoodie.cleaner.commits.retained" = "180"
           "hoodie.metadata.cleaner.commits.retained" = "180"
           "hoodie.keep.min.commits" = "200"
           "hoodie.keep.max.commits" = "240"
           "hoodie.clustering.inline" = "false"
           "hoodie.clustering.inline.max.commits" = "4"
           "hoodie.clustering.plan.strategy.target.file.max.bytes" = 
"1073741824"
           "hoodie.clustering.plan.strategy.small.file.limit" = "629145600"
           "hoodie.metadata.enable" = "false"
           "hoodie.metadata.keep.min.commits" = "12"
           "hoodie.metadata.keep.max.commits" = "24"
           "hoodie.datasource.compaction.async.enable" = "false"
           "hoodie.write.markers.type" = "DIRECT"
           "hoodie.embed.timeline.server" = "true"
           "hoodie.index.type" = "BLOOM"
           "hoodie.bloom.index.update.partition.path" = "true"
           "hoodie.compact.inline.max.delta.seconds" = "7200"
           "hoodie.compact.inline.trigger.strategy" = "TIME_ELAPSED"
           "hoodie.copyonwrite.insert.split.size" = "50000"
           "hoodie.bloom.index.prune.by.ranges" = "true"
           "hoodie.memory.merge.max.size" = "8589934592"
           "hoodie.datasource.write.insert.drop.duplicates" = "false"
           "hoodie.metrics.on" = "true"
           "hoodie.metrics.reporter.type" = "JMX"
           "hoodie.datasource.hive_sync.partition_fields" = "root_account_uuid"
           "hoodie.datasource.hive_sync.mode" = "hms"
           "hoodie.datasource.hive_sync.enable" = "true"
           "hoodie.datasource.hive_sync.database" = "${glue_database}"
   `
   
   Are you aware of some degradation like this?
   
   **To Reproduce**
   
   
   **Expected behavior** 
   These metrics should stay the same
   
   **Environment Description** We upgraded from EMR 6.7(hudi 0.11.0) to EMR 
6.9(0.12.1)
   
   * Hudi version : 0.12.1
   
   * Spark version : 3.3.0
   
   * Hive version : - 
   
   * Hadoop version : 3.3.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] VitoMakarevich opened a new issue, #7734: [SUPPORT] Increased rate of object storage calls after upgrade from 0.11.0 to 0.12.1

Reply via email to