Re: Spark driver memory issue when scanning all columns of a iceberg table

Manu Zhang Sun, 18 Sep 2022 18:33:04 -0700

Sorry for a typo in the previous reply.

The questions is why we don't drop stats when
*`rowFilter == Expressions.alwaysTrue()`*


On Mon, Sep 19, 2022 at 9:30 AM Manu Zhang <owenzhang1...@gmail.com> wrote:

> Hi all,
>
> Does anyone know why we don't drop stats when `rowFilter != Expressions.
> alwaysTrue()` at
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L326
> ?
> I tried removing it but all tests passed.
>
> Thanks,
> Manu
>
>
>
> On Sat, Sep 10, 2022 at 8:56 PM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Ryan,
>>
>> I found data files did a full copy (deep copy) of all stats from the
>> manifest file when rowFilter is true. With a large number of data files, so
>> much memory could be taken up by stats like valueCounts.
>> I also attached snapshots of the heap dump in the GitHub issue comments.
>> Please help confirm.
>>
>> Thanks,
>> Manu
>>
>> On Fri, Sep 9, 2022 at 7:09 AM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest
>>> columns rather than data columns. I’ll try your suggestions and get back.
>>>
>>> Manu
>>>
>>> Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道：
>>>
>>>> Manu,
>>>>
>>>> The check that you linked to where stats aren’t dropped is when someone
>>>> is asking for all columns from a manifest file, not when your data query is
>>>> requesting all columns. In the case of your query, Spark is not asking for
>>>> stats columns. They will be used for filtering, but will be dropped before
>>>> passing the DataFile to the scan as a matching result file.
>>>>
>>>> I’ll post a more detailed reply on the issue, but when we’ve seen this
>>>> issue in the past the problem is usually that your planning parallelism is
>>>> high (based on the environment) and the parallel planning is adding them to
>>>> a queue. You can avoid that by setting iceberg.worker.num-threads=2
>>>> (or something small) or disabling parallel planning by setting
>>>> iceberg.scan.plan-in-worker-pool=false. Both of those are Java system
>>>> properties.
>>>>
>>>> Ryan
>>>>
>>>> On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> It looks scanning all columns of an iceberg table in Spark could cause
>>>>> memory issue in the driver by keeping all the stats.
>>>>>
>>>>> *select * from iceberg_table limit 10;*
>>>>>
>>>>> I also created https://github.com/apache/iceberg/issues/5706 with
>>>>> more details.
>>>>> Is there any reason not to drop stats
>>>>> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292>
>>>>> when columns contain ALL_COLUMNS(*)?
>>>>>
>>>>> Thanks,
>>>>> Manu
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>

Re: Spark driver memory issue when scanning all columns of a iceberg table

Reply via email to