hi Nicolas
This problem is caused by historical parquet version. To fix it, you need
to ensure parquet version in your spark runtime is upgraded to the latest
one. In most cases, parquet version is determined by spark version by
default. Though hudi depends on parquet, such a fix not happen on the
parquet interface used by hudi. You can simply upgrade spark to latest
version and check if it is fixed w/O change anything in hudi
From: "nicolas paris"<nicolas.pa...@riseup.net>
Date: Mon, Nov 20, 2023, 20:07
Subject: [External] Current state of parquet zstd OOM with hudi
To: "Hudi Dev List"<dev@hudi.apache.org>
hey month ago someone spotted memory leak while reading zstd files with
hudi https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280
since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0
https://issues.apache.org/jira/browse/SPARK-41952 we are currently on spark
3.2.4, hudi 0.13.1 and having similar issue (massive off-heap usage) while
scanning very large hudi tables backed with zstd What is the state of this
issue? is there any patch to apply on hudi side as well or can I consider
it fixed by using spark 3.2.4 ? I attach a graph from the uber jvm-profiler
to illustrate our current troubles. thanks by advance

Reply via email to