amoeba commented on issue #37495:
URL: https://github.com/apache/arrow/issues/37495#issuecomment-1739743649

   I know we've see memory issues with PyArrow code like and I suspect the R 
package uses a similar code path:
   
   ```python
   table_ds = ds.dataset([path_to_parquet_file], 
filesystem=fs.LocalFileSystem())
   conn.from_arrow(table_ds).limit(some_n).arrow()
   ```
   
   `from_arrow` creates a Scanner on the Arrow side and the leak-like behavior 
only happens when the `limit()` clause is used. This is because the limit() 
triggers cancellation of Scanner and the Scanner isn't cancel aware. This is 
related to https://github.com/apache/arrow/issues/20338 I think.
   
   Before filing an issue with DuckDB, I can take a look to see if it looks 
reasonable the same thing is happening here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to