Re: Spark driver memory issue when scanning all columns of a iceberg table

Ryan Blue Mon, 19 Sep 2022 11:20:16 -0700

I just did some digging and I think that the reason for checking whether
the row filter is always true is to match the logic when we add stats
columns. We only add stats columns if the row filter is non-trivial (not
always true) so we probably carried that over when removing stats columns.
If tests are passing without this, then I think it would be safe to drop
that requirement.


Ryan

On Sun, Sep 18, 2022 at 6:33 PM Manu Zhang <owenzhang1...@gmail.com> wrote:

> Sorry for a typo in the previous reply.
>
> The questions is why we don't drop stats when
> *`rowFilter == Expressions.alwaysTrue()`*
>
> On Mon, Sep 19, 2022 at 9:30 AM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Does anyone know why we don't drop stats when `rowFilter != Expressions.
>> alwaysTrue()` at
>> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L326
>> ?
>> I tried removing it but all tests passed.
>>
>> Thanks,
>> Manu
>>
>>
>>
>> On Sat, Sep 10, 2022 at 8:56 PM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> Ryan,
>>>
>>> I found data files did a full copy (deep copy) of all stats from the
>>> manifest file when rowFilter is true. With a large number of data files, so
>>> much memory could be taken up by stats like valueCounts.
>>> I also attached snapshots of the heap dump in the GitHub issue comments.
>>> Please help confirm.
>>>
>>> Thanks,
>>> Manu
>>>
>>> On Fri, Sep 9, 2022 at 7:09 AM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest
>>>> columns rather than data columns. I’ll try your suggestions and get back.
>>>>
>>>> Manu
>>>>
>>>> Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道：
>>>>
>>>>> Manu,
>>>>>
>>>>> The check that you linked to where stats aren’t dropped is when
>>>>> someone is asking for all columns from a manifest file, not when your data
>>>>> query is requesting all columns. In the case of your query, Spark is not
>>>>> asking for stats columns. They will be used for filtering, but will be
>>>>> dropped before passing the DataFile to the scan as a matching result
>>>>> file.
>>>>>
>>>>> I’ll post a more detailed reply on the issue, but when we’ve seen this
>>>>> issue in the past the problem is usually that your planning parallelism is
>>>>> high (based on the environment) and the parallel planning is adding them 
>>>>> to
>>>>> a queue. You can avoid that by setting iceberg.worker.num-threads=2
>>>>> (or something small) or disabling parallel planning by setting
>>>>> iceberg.scan.plan-in-worker-pool=false. Both of those are Java system
>>>>> properties.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> It looks scanning all columns of an iceberg table in Spark could
>>>>>> cause memory issue in the driver by keeping all the stats.
>>>>>>
>>>>>> *select * from iceberg_table limit 10;*
>>>>>>
>>>>>> I also created https://github.com/apache/iceberg/issues/5706 with
>>>>>> more details.
>>>>>> Is there any reason not to drop stats
>>>>>> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292>
>>>>>> when columns contain ALL_COLUMNS(*)?
>>>>>>
>>>>>> Thanks,
>>>>>> Manu
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

-- 
Ryan Blue
Tabular

Re: Spark driver memory issue when scanning all columns of a iceberg table

Reply via email to