Sorry for a typo in the previous reply. The questions is why we don't drop stats when *`rowFilter == Expressions.alwaysTrue()`*
On Mon, Sep 19, 2022 at 9:30 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > Hi all, > > Does anyone know why we don't drop stats when `rowFilter != Expressions. > alwaysTrue()` at > https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestReader.java#L326 > ? > I tried removing it but all tests passed. > > Thanks, > Manu > > > > On Sat, Sep 10, 2022 at 8:56 PM Manu Zhang <owenzhang1...@gmail.com> > wrote: > >> Ryan, >> >> I found data files did a full copy (deep copy) of all stats from the >> manifest file when rowFilter is true. With a large number of data files, so >> much memory could be taken up by stats like valueCounts. >> I also attached snapshots of the heap dump in the GitHub issue comments. >> Please help confirm. >> >> Thanks, >> Manu >> >> On Fri, Sep 9, 2022 at 7:09 AM Manu Zhang <owenzhang1...@gmail.com> >> wrote: >> >>> Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest >>> columns rather than data columns. I’ll try your suggestions and get back. >>> >>> Manu >>> >>> Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道: >>> >>>> Manu, >>>> >>>> The check that you linked to where stats aren’t dropped is when someone >>>> is asking for all columns from a manifest file, not when your data query is >>>> requesting all columns. In the case of your query, Spark is not asking for >>>> stats columns. They will be used for filtering, but will be dropped before >>>> passing the DataFile to the scan as a matching result file. >>>> >>>> I’ll post a more detailed reply on the issue, but when we’ve seen this >>>> issue in the past the problem is usually that your planning parallelism is >>>> high (based on the environment) and the parallel planning is adding them to >>>> a queue. You can avoid that by setting iceberg.worker.num-threads=2 >>>> (or something small) or disabling parallel planning by setting >>>> iceberg.scan.plan-in-worker-pool=false. Both of those are Java system >>>> properties. >>>> >>>> Ryan >>>> >>>> On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> It looks scanning all columns of an iceberg table in Spark could cause >>>>> memory issue in the driver by keeping all the stats. >>>>> >>>>> *select * from iceberg_table limit 10;* >>>>> >>>>> I also created https://github.com/apache/iceberg/issues/5706 with >>>>> more details. >>>>> Is there any reason not to drop stats >>>>> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292> >>>>> when columns contain ALL_COLUMNS(*)? >>>>> >>>>> Thanks, >>>>> Manu >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>>