rdblue commented on issue #5706:
URL: https://github.com/apache/iceberg/issues/5706#issuecomment-1239800060

   @manuzhang, are you sure that selecting a specific set of columns resolves 
the issue? The logic in the manifest reader is checking the set of projection 
manifest columns, not data columns. The manifest columns that are projected are 
controlled by the 
[`ManifestGroup`](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/ManifestGroup.java)
 and are [set for a table scan to a 
subset](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/BaseScan.java#L94).
 The files returned by scan planning should not contain stats because 
[`includeColumnStats`](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/BaseScan.java#L133)
 is not called by Spark.
   
   We've had a report of this before, where the table had a huge number of 
small files that matched the query filter. Since the query filter is `true` 
here, you're probably in a similar case. You're producing a ton of data files 
from parallel job planning. Since jobs are planned in parallel, we put the 
results on a queue and they sit around. You can work around this by disabling 
parallel planning or hopefully by reducing the number of worker threads using 
the [Java system 
properties](https://github.com/apache/iceberg/blob/c0f175b517edef5dd1b06a5ac7890fb35224bc25/core/src/main/java/org/apache/iceberg/SystemProperties.java).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to