[ 
https://issues.apache.org/jira/browse/ARROW-13611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398306#comment-17398306
 ] 

Weston Pace commented on ARROW-13611:
-------------------------------------

Additional information.

The synchronous scanner is properly applying back pressure when the source has 
multiple scan tasks (e.g. parquet).  However it is not applying back pressure 
when sources have a single scan task (e.g. CSV / IPC).  The root cause is here: 
https://github.com/apache/arrow/blob/f959141ece4d660bce5f7fa545befc0116a7db79/cpp/src/arrow/dataset/scanner.cc#L207

The asynchronous scanner is not applying back pressure because it is now 
connected to an ExecPlan and ExecPlan does not have back pressure implemented 
yet.

I'm not sure how much effort we want to put into fixing the synchronous scanner 
given that it is slated for removal.

> [C++][Python] Scanning datasets in pyarrow does not enforce back pressure
> -------------------------------------------------------------------------
>
>                 Key: ARROW-13611
>                 URL: https://issues.apache.org/jira/browse/ARROW-13611
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 4.0.0, 5.0.0, 4.0.1
>            Reporter: Weston Pace
>            Priority: Major
>             Fix For: 6.0.0
>
>
> At the moment I'm not sure if the issue is in the C++ layer or the python 
> layer.  I have a simple test case where I scan the batches of a 4GB dataset 
> and print out the currently used memory:
> {code:python}
> import pyarrow as pa
> import pyarrow.dataset as ds
> dataset = ds.dataset('/home/pace/dev/data/dataset/csv/5_big', format='csv')
> num_rows = 0
> for batch in dataset.to_batches():
>     print(pa.total_allocated_bytes())
>     num_rows += batch.num_rows
> print(num_rows)
> {code}
> In pyarrow 3.0.0 this consumes just over 5MB.  In pyarrow 4.0.0 and 5.0.0 
> this consumes multiple GB of RAM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to