[GitHub] [iceberg] rdblue commented on issue #5706: Not dropping stats when scanning all columns caused Spark driver exceeding GC limit


rdblue commented on issue #5706:
URL: https://github.com/apache/iceberg/issues/5706#issuecomment-1239800060

@manuzhang, are you sure that selecting a specific set of columns resolves
the issue? The logic in the manifest reader is checking the set of projection
manifest columns, not data columns. The manifest columns that are projected are
controlled by the
[`ManifestGroup`](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/ManifestGroup.java)
and are [set for a table scan to a
subset](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/BaseScan.java#L94).
The files returned by scan planning should not contain stats because
[`includeColumnStats`](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/BaseScan.java#L133)
is not called by Spark.

We've had a report of this before, where the table had a huge number of
small files that matched the query filter. Since the query filter is `true`
here, you're probably in a similar case. You're producing a ton of data files
from parallel job planning. Since jobs are planned in parallel, we put the
results on a queue and they sit around. You can work around this by disabling
parallel planning or hopefully by reducing the number of worker threads using
the [Java system
properties](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/SystemProperties.java).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to