[I] [SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD [hudi]

via GitHub Thu, 05 Dec 2024 12:06:49 -0800


mzheng-plaid opened a new issue, #12434:
URL: https://github.com/apache/hudi/issues/12434


   **Describe the problem you faced**
   
   We have jobs that read from a MOR table using the following pyspark 
pseudo-code (`event_table_rt` is the MOR table):
   ```
   partitions = ["2023-11-13", "2023-11-14", "2023-11-15", "2023-11-16", 
"2023-11-17"]
   event_df = spark.sql("select * from event_table_rt").filter(
       F.col("dt").isin(partitions)
   )
   user_df = spark.read.format("csv").option("header", "true").load(users_path)
   filtered_events_df = df.join(
       F.broadcast(user_df),
       on=df["user_id"] == user_df["id"],
       how="inner",
   )
   filtered_events_df.write.format("parquet").save("s3://...")
   ```
   
   We're running into a bottleneck on `HoodieMergeOnReadRDD` 
(https://github.com/apache/hudi/blob/release-0.14.2/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L37)
 where the number of tasks in the stage reading `event_df` seems to be 
non-configurable and (I think) equal to the number of files being read. This is 
causing massive disk/memory spill and bottlenecking performance.
   
   Is it possible to configure the read parallelism to be higher or is this a 
fundamental limitation of Hudi with MOR tables? What is the recommendation for 
how to tune resourcing for readers of MOR tables?
   
   **Environment Description**
   
   * Hudi version : 0.14.1-amzn-1 (EMR 7.2.0)
   
   * Spark version : 3.5.1
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Queries are very memory intensive due to low read parallelism in HoodieMergeOnReadRDD [hudi]

Reply via email to