Hi, thanks for tour answer. Do you mean upgrading to parquet 0.13.x ? BTW spark introduced a workaround in 3.2.4. Do you mean hudi bypass the workaround ? Thanks.
Nov 20, 2023 13:37:58 管梓越 <guanziyue....@bytedance.com.INVALID>: > hi Nicolas > This problem is caused by historical parquet version. To fix it, you need > to ensure parquet version in your spark runtime is upgraded to the latest > one. In most cases, parquet version is determined by spark version by > default. Though hudi depends on parquet, such a fix not happen on the > parquet interface used by hudi. You can simply upgrade spark to latest > version and check if it is fixed w/O change anything in hudi > From: "nicolas paris"<nicolas.pa...@riseup.net> > Date: Mon, Nov 20, 2023, 20:07 > Subject: [External] Current state of parquet zstd OOM with hudi > To: "Hudi Dev List"<dev@hudi.apache.org> > hey month ago someone spotted memory leak while reading zstd files with > hudi https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280 > since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0 > https://issues.apache.org/jira/browse/SPARK-41952 we are currently on spark > 3.2.4, hudi 0.13.1 and having similar issue (massive off-heap usage) while > scanning very large hudi tables backed with zstd What is the state of this > issue? is there any patch to apply on hudi side as well or can I consider > it fixed by using spark 3.2.4 ? I attach a graph from the uber jvm-profiler > to illustrate our current troubles. thanks by advance