rdblue commented on issue #5706: URL: https://github.com/apache/iceberg/issues/5706#issuecomment-1239800060
@manuzhang, are you sure that selecting a specific set of columns resolves the issue? The logic in the manifest reader is checking the set of projection manifest columns, not data columns. The manifest columns that are projected are controlled by the [`ManifestGroup`](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/ManifestGroup.java) and are [set for a table scan to a subset](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/BaseScan.java#L94). The files returned by scan planning should not contain stats because [`includeColumnStats`](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/BaseScan.java#L133) is not called by Spark. We've had a report of this before, where the table had a huge number of small files that matched the query filter. Since the query filter is `true` here, you're probably in a similar case. You're producing a ton of data files from parallel job planning. Since jobs are planned in parallel, we put the results on a queue and they sit around. You can work around this by disabling parallel planning or hopefully by reducing the number of worker threads using the [Java system properties](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/SystemProperties.java). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
