Re: [External] Current state of parquet zstd OOM with hudi

Nicolas Paris Mon, 20 Nov 2023 06:08:02 -0800

Hi, thanks for tour answer. Do you mean upgrading to parquet 0.13.x ? BTW spark 
introduced a workaround in 3.2.4. Do you mean hudi bypass the workaround ?
Thanks.


Nov 20, 2023 13:37:58 管梓越 <guanziyue....@bytedance.com.INVALID>:

> hi Nicolas
> This problem is caused by historical parquet version. To fix it, you need
> to ensure parquet version in your spark runtime is upgraded to the latest
> one. In most cases, parquet version is determined by spark version by
> default. Though hudi depends on parquet, such a fix not happen on the
> parquet interface used by hudi. You can simply upgrade spark to latest
> version and check if it is fixed w/O change anything in hudi
> From: "nicolas paris"<nicolas.pa...@riseup.net>
> Date: Mon, Nov 20, 2023, 20:07
> Subject: [External] Current state of parquet zstd OOM with hudi
> To: "Hudi Dev List"<dev@hudi.apache.org>
> hey month ago someone spotted memory leak while reading zstd files with
> hudi https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280
> since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0
> https://issues.apache.org/jira/browse/SPARK-41952 we are currently on spark
> 3.2.4, hudi 0.13.1 and having similar issue (massive off-heap usage) while
> scanning very large hudi tables backed with zstd What is the state of this
> issue? is there any patch to apply on hudi side as well or can I consider
> it fixed by using spark 3.2.4 ? I attach a graph from the uber jvm-profiler
> to illustrate our current troubles. thanks by advance

Re: [External] Current state of parquet zstd OOM with hudi

Reply via email to