[GitHub] [arrow-datafusion] DDtKey opened a new issue, #5162: Ignoring of memory-pool limits & OOM on large cartesian-product join

via GitHub Tue, 14 Mar 2023 06:56:45 -0700


DDtKey opened a new issue, #5162:
URL: https://github.com/apache/arrow-datafusion/issues/5162


   **Describe the bug**
   There is an issue with possible OOM instead of `ResourcesExhausted`
   Probably related to usage of unbounded channels (I believe it should be 
avoided actually)
   
   **To Reproduce**
   
   MRE to achieve ignoring of memory-pool with large Cartesian product:
   
   CSV File example (250mb): [GDrive 
link](https://drive.google.com/file/d/1q_-p8BvvO2w-0IH7SyxvDIOYK44yQIKt/view?usp=share_link)
 - it's random file and column to join by has the same value for all records 
(so it's cartesian product)
   
   Memory pool limit: `FairSpillPool::new(4 * 1024 * 1024 * 1024)`
   
   SQL:
   `SELECT * FROM rnd rnd1 JOIN rnd rnd2 ON rnd1."s3_drive" = rnd2."s3_drive"`
   
   **Expected behavior**
   
   It should return`ResourcesExhausted` error with configured `MemoryPool`
   
   **Additional context**
   Add any other context about the problem here.
   
   A part of this was described in the discussion here: 
https://github.com/apache/arrow-datafusion/issues/5108#issuecomment-1412491794, 
but there was mentioned the regression. 
   This example isn't regression and it's reproducible for old versions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] DDtKey opened a new issue, #5162: Ignoring of memory-pool limits & OOM on large cartesian-product join

Reply via email to