alamb commented on code in PR #14766:
URL: https://github.com/apache/datafusion/pull/14766#discussion_r1961557419


##########
datafusion-cli/src/main.rs:
##########
@@ -121,6 +121,13 @@ struct Args {
     )]
     maxrows: MaxRows,
 
+    #[clap(
+        short,
+        long,
+        help = "Whether to stop early when max rows is reached, this will help 
reduce the memory usage when the result is large"

Review Comment:
   I think this option is basically the same as adding a `LIMIT max_rows` to 
the query and thus is somewhat redundant



##########
datafusion-cli/src/exec.rs:
##########
@@ -249,8 +255,21 @@ pub(super) async fn exec_and_print(
         } else {
             // Bounded stream; collected results are printed after all input 
consumed.
             let schema = physical_plan.schema();
-            let results = collect(physical_plan, task_ctx.clone()).await?;
+            let mut stream = execute_stream(physical_plan, task_ctx.clone())?;
+            let mut results = vec![];
+            while let Some(batch) = stream.next().await {
+                let batch = batch?;
+                reservation.try_grow(get_record_batch_memory_size(&batch))?;
+                results.push(batch);
+                if let MaxRows::Limited(max_rows) = print_options.maxrows {
+                    // Stop collecting results if the number of rows exceeds 
the limit
+                    if results.len() >= max_rows && 
print_options.stop_after_max_rows {

Review Comment:
   If you break here it will stop the plan early -- then you need to add 
`stop_after_max_rows` to permit the plan to run to completion, however, if that 
option is set then the data is buffered again (which from a memory perspective 
is the same as increasing the max_rows)
   
   Rather than `break`ing here, I recommend just ignoring the batches after 
`max_rows` has been reached (aka don't `push` them into `results`) -- I think 
that would be the most intuitive behavior:
   1. Queries would still always run to completion
   2. Max rows would control how much memory was used by datafusion-cli 
buffering
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to