Re: Spark driver memory issue when scanning all columns of a iceberg table

Ryan Blue Wed, 07 Sep 2022 12:39:05 -0700

Manu,

The check that you linked to where stats aren’t dropped is when someone is
asking for all columns from a manifest file, not when your data query is
requesting all columns. In the case of your query, Spark is not asking for
stats columns. They will be used for filtering, but will be dropped before
passing the DataFile to the scan as a matching result file.

I’ll post a more detailed reply on the issue, but when we’ve seen this
issue in the past the problem is usually that your planning parallelism is
high (based on the environment) and the parallel planning is adding them to
a queue. You can avoid that by setting iceberg.worker.num-threads=2 (or
something small) or disabling parallel planning by setting
iceberg.scan.plan-in-worker-pool=false. Both of those are Java system
properties.

Ryan

On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com> wrote:

> Hi all,
>
> It looks scanning all columns of an iceberg table in Spark could cause
> memory issue in the driver by keeping all the stats.
>
> *select * from iceberg_table limit 10;*
>
> I also created https://github.com/apache/iceberg/issues/5706 with more
> details.
> Is there any reason not to drop stats
> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292>
> when columns contain ALL_COLUMNS(*)?
>
> Thanks,
> Manu
>

-- 
Ryan Blue
Tabular

Re: Spark driver memory issue when scanning all columns of a iceberg table

Reply via email to