Thanks Ryan for explanation. Yes, I got it wrong and it’s manifest columns rather than data columns. I’ll try your suggestions and get back.
Manu Ryan Blue <b...@tabular.io>于2022年9月8日 周四03:39写道: > Manu, > > The check that you linked to where stats aren’t dropped is when someone is > asking for all columns from a manifest file, not when your data query is > requesting all columns. In the case of your query, Spark is not asking for > stats columns. They will be used for filtering, but will be dropped before > passing the DataFile to the scan as a matching result file. > > I’ll post a more detailed reply on the issue, but when we’ve seen this > issue in the past the problem is usually that your planning parallelism is > high (based on the environment) and the parallel planning is adding them to > a queue. You can avoid that by setting iceberg.worker.num-threads=2 (or > something small) or disabling parallel planning by setting > iceberg.scan.plan-in-worker-pool=false. Both of those are Java system > properties. > > Ryan > > On Tue, Sep 6, 2022 at 11:06 PM Manu Zhang <owenzhang1...@gmail.com> > wrote: > >> Hi all, >> >> It looks scanning all columns of an iceberg table in Spark could cause >> memory issue in the driver by keeping all the stats. >> >> *select * from iceberg_table limit 10;* >> >> I also created https://github.com/apache/iceberg/issues/5706 with more >> details. >> Is there any reason not to drop stats >> <https://github.com/apache/iceberg/blob/apache-iceberg-0.13.1/core/src/main/java/org/apache/iceberg/ManifestReader.java#L292> >> when columns contain ALL_COLUMNS(*)? >> >> Thanks, >> Manu >> > > > -- > Ryan Blue > Tabular >