Ryan,

I found data files did a full copy (deep copy) of all stats from the
manifest file when rowFilter is true. With a large number of data files, so
much memory could be taken up by stats like valueCounts.
I also attached snapshots of the heap dump in the GitHub issue comments.
Please help confirm.

Thanks,
Manu

On Fri, Sep 9, 2022 at 7:09 AM Manu Zhang <owenzhang1...@gmail.com> wrote:

> Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest columns
> rather than data columns. I’ll try your suggestions and get back.
>
> Manu
>
> Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道:
>
>> Manu,
>>
>> The check that you linked to where stats aren’t dropped is when someone
>> is asking for all columns from a manifest file, not when your data query is
>> requesting all columns. In the case of your query, Spark is not asking for
>> stats columns. They will be used for filtering, but will be dropped before
>> passing the DataFile to the scan as a matching result file.
>>
>> I’ll post a more detailed reply on the issue, but when we’ve seen this
>> issue in the past the problem is usually that your planning parallelism is
>> high (based on the environment) and the parallel planning is adding them to
>> a queue. You can avoid that by setting iceberg.worker.num-threads=2 (or
>> something small) or disabling parallel planning by setting
>> iceberg.scan.plan-in-worker-pool=false. Both of those are Java system
>> properties.
>>
>> Ryan
>>
>> On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> It looks scanning all columns of an iceberg table in Spark could cause
>>> memory issue in the driver by keeping all the stats.
>>>
>>> *select * from iceberg_table limit 10;*
>>>
>>> I also created https://github.com/apache/iceberg/issues/5706 with more
>>> details.
>>> Is there any reason not to drop stats
>>> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292>
>>> when columns contain ALL_COLUMNS(*)?
>>>
>>> Thanks,
>>> Manu
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to