shardulm94 opened a new pull request #2694:
URL: https://github.com/apache/iceberg/pull/2694
One of our Spark apps using Iceberg started reporting huge GCs and
eventually getting killed by YARN due to OOM. Looking at jmap we found that
Iceberg was creating too many scan tasks
```
num #instances #bytes class name
----------------------------------------------
1: 29334684 938709888
org.apache.iceberg.BaseFileScanTask$SplitScanTask
2: 29334673 704032152 [Lorg.apache.iceberg.FileScanTask;
3: 29334673 469354768 org.apache.iceberg.BaseCombinedScanTask
4: 8964 125336304 [Ljava.lang.Object;
5: 47486 7061560 [C
6: 15733 1723088 java.lang.Class
```
Turns out this was because of an integer overflow when user passed in
`split-size`. User provided value was `2048 * 1024 * 1024` which was converted
to `-2147483648`. Since it really does not make sense for these split planning
parameters to be negative, I added some checks. One thing I am not too sure on,
should we allow split open file cost to be negative? Since it's a cost, it can
theoretically be negative but I don't see any practical use case for that.
Thanks @venkata91, who initially figured out the issue in our ecosystem.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]