Re: Spark driver memory issue when scanning all columns of a iceberg table

Manu Zhang Thu, 08 Sep 2022 16:09:40 -0700

Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest columns
rather than data columns. I’ll try your suggestions and get back.


Manu

Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道：

> Manu,
>
> The check that you linked to where stats aren’t dropped is when someone is
> asking for all columns from a manifest file, not when your data query is
> requesting all columns. In the case of your query, Spark is not asking for
> stats columns. They will be used for filtering, but will be dropped before
> passing the DataFile to the scan as a matching result file.
>
> I’ll post a more detailed reply on the issue, but when we’ve seen this
> issue in the past the problem is usually that your planning parallelism is
> high (based on the environment) and the parallel planning is adding them to
> a queue. You can avoid that by setting iceberg.worker.num-threads=2 (or
> something small) or disabling parallel planning by setting
> iceberg.scan.plan-in-worker-pool=false. Both of those are Java system
> properties.
>
> Ryan
>
> On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> It looks scanning all columns of an iceberg table in Spark could cause
>> memory issue in the driver by keeping all the stats.
>>
>> *select * from iceberg_table limit 10;*
>>
>> I also created https://github.com/apache/iceberg/issues/5706 with more
>> details.
>> Is there any reason not to drop stats
>> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292>
>> when columns contain ALL_COLUMNS(*)?
>>
>> Thanks,
>> Manu
>>
>
>
> --
> Ryan Blue
> Tabular
>

Re: Spark driver memory issue when scanning all columns of a iceberg table

Reply via email to