Re: Spark driver memory issue when scanning all columns of a iceberg table

Manu Zhang Wed, 21 Sep 2022 18:44:24 -0700

Thanks Ryan, I've submitted a PR
<https://github.com/apache/iceberg/issues/5706> to fix it. Please help
review.


On Tue, Sep 20, 2022 at 2:20 AM Ryan Blue <b...@tabular.io> wrote:

> I just did some digging and I think that the reason for checking whether
> the row filter is always true is to match the logic when we add stats
> columns. We only add stats columns if the row filter is non-trivial (not
> always true) so we probably carried that over when removing stats columns.
> If tests are passing without this, then I think it would be safe to drop
> that requirement.
>
> Ryan
>
> On Sun, Sep 18, 2022 at 6:33 PM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Sorry for a typo in the previous reply.
>>
>> The questions is why we don't drop stats when
>> *`rowFilter == Expressions.alwaysTrue()`*
>>
>> On Mon, Sep 19, 2022 at 9:30 AM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Does anyone know why we don't drop stats when `rowFilter != Expressions.
>>> alwaysTrue()` at
>>> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L326
>>> ?
>>> I tried removing it but all tests passed.
>>>
>>> Thanks,
>>> Manu
>>>
>>>
>>>
>>> On Sat, Sep 10, 2022 at 8:56 PM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> Ryan,
>>>>
>>>> I found data files did a full copy (deep copy) of all stats from the
>>>> manifest file when rowFilter is true. With a large number of data files, so
>>>> much memory could be taken up by stats like valueCounts.
>>>> I also attached snapshots of the heap dump in the GitHub issue
>>>> comments. Please help confirm.
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>> On Fri, Sep 9, 2022 at 7:09 AM Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest
>>>>> columns rather than data columns. I’ll try your suggestions and get back.
>>>>>
>>>>> Manu
>>>>>
>>>>> Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道：
>>>>>
>>>>>> Manu,
>>>>>>
>>>>>> The check that you linked to where stats aren’t dropped is when
>>>>>> someone is asking for all columns from a manifest file, not when your 
>>>>>> data
>>>>>> query is requesting all columns. In the case of your query, Spark is not
>>>>>> asking for stats columns. They will be used for filtering, but will be
>>>>>> dropped before passing the DataFile to the scan as a matching result
>>>>>> file.
>>>>>>
>>>>>> I’ll post a more detailed reply on the issue, but when we’ve seen
>>>>>> this issue in the past the problem is usually that your planning
>>>>>> parallelism is high (based on the environment) and the parallel planning 
>>>>>> is
>>>>>> adding them to a queue. You can avoid that by setting
>>>>>> iceberg.worker.num-threads=2 (or something small) or disabling
>>>>>> parallel planning by setting iceberg.scan.plan-in-worker-pool=false.
>>>>>> Both of those are Java system properties.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> It looks scanning all columns of an iceberg table in Spark could
>>>>>>> cause memory issue in the driver by keeping all the stats.
>>>>>>>
>>>>>>> *select * from iceberg_table limit 10;*
>>>>>>>
>>>>>>> I also created https://github.com/apache/iceberg/issues/5706 with
>>>>>>> more details.
>>>>>>> Is there any reason not to drop stats
>>>>>>> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292>
>>>>>>> when columns contain ALL_COLUMNS(*)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Manu
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Spark driver memory issue when scanning all columns of a iceberg table

Reply via email to