mzheng-plaid opened a new issue, #12434:
URL: https://github.com/apache/hudi/issues/12434
**Describe the problem you faced**
We have jobs that read from a MOR table using the following pyspark
pseudo-code (`event_table_rt` is the MOR table):
```
partitions = ["2023-11-13", "2023-11-14", "2023-11-15", "2023-11-16",
"2023-11-17"]
event_df = spark.sql("select * from event_table_rt").filter(
F.col("dt").isin(partitions)
)
user_df = spark.read.format("csv").option("header", "true").load(users_path)
filtered_events_df = df.join(
F.broadcast(user_df),
on=df["user_id"] == user_df["id"],
how="inner",
)
filtered_events_df.write.format("parquet").save("s3://...")
```
We're running into a bottleneck on `HoodieMergeOnReadRDD`
(https://github.com/apache/hudi/blob/release-0.14.2/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala#L37)
where the number of tasks in the stage reading `event_df` seems to be
non-configurable and (I think) equal to the number of files being read. This is
causing massive disk/memory spill and bottlenecking performance.
Is it possible to configure the read parallelism to be higher or is this a
fundamental limitation of Hudi with MOR tables? What is the recommendation for
how to tune resourcing for readers of MOR tables?
**Environment Description**
* Hudi version : 0.14.1-amzn-1 (EMR 7.2.0)
* Spark version : 3.5.1
* Hive version : 3.1.3
* Hadoop version : 3.3.6
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]