shardulm94 opened a new pull request #2694:
URL: https://github.com/apache/iceberg/pull/2694


   One of our Spark apps using Iceberg started reporting huge GCs and 
eventually getting killed by YARN due to OOM. Looking at jmap we found that 
Iceberg was creating too many scan tasks
   ```
   num     #instances         #bytes  class name
   ----------------------------------------------
      1:      29334684      938709888  
org.apache.iceberg.BaseFileScanTask$SplitScanTask
      2:      29334673      704032152  [Lorg.apache.iceberg.FileScanTask;
      3:      29334673      469354768  org.apache.iceberg.BaseCombinedScanTask
      4:          8964      125336304  [Ljava.lang.Object;
      5:         47486        7061560  [C
      6:         15733        1723088  java.lang.Class
    ```
    Turns out this was because of an integer overflow when user passed in 
`split-size`. User provided value was `2048 * 1024 * 1024` which was converted 
to `-2147483648`. Since it really does not make sense for these split planning 
parameters to be negative, I added some checks. One thing I am not too sure on, 
should we allow split open file cost to be negative? Since it's a  cost, it can 
theoretically be negative but I don't see any practical use case for that.
    
   Thanks @venkata91, who initially figured out the issue in our ecosystem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to