westonpace commented on issue #40224:
URL: https://github.com/apache/arrow/issues/40224#issuecomment-1991754534

   > Sorry, I'm a bit lost here; what are the implications of write versus sink 
here?
   
   I suppose it is more about "two plans" vs. "one plan".
   
   A plan's output can be a record batch reader (the final node is a sink 
node).  A plan's input can be a record batch reader.  A write plan has no 
output (the final node is a write node).
   
   So, to scan and rewrite a dataset, you can have two plans:
   
   Plan 1: Scan(files) -> Project -> Sink
   Plan 2: Scan(record_batch_reader) -> Write
   
   Or you can do it all in one plan:
   
   Combined: Scan(files) -> Project -> Write
   
   In python, since there is no equivalent of dplyr, the user cannot write a 
"single statement" like `open_dataset |> mutate(...) |> write_dataset(...)`.  
Instead the user has to do something like...
   
   ```
   projected = dataset.to_reader(...) # Creates an acero plan
   ds.write_dataset(projected) # Creates a second acero plan
   ```
   
   However, since you have a single dplyr plan in R, it might be possible to 
make a single acero plan.  That being said, I don't expect it to have much 
impact.
   
   > I was just looking at free/available memory - would RSS of the process be 
better?
   
   When writing a dataset the server will generally use all available memory.  
This is because a "write" call just copies data from the process RSS to the 
kernel page cache.  It doesn't wait and block until all the data is persisted 
to disk.  So you will generally see `it gradually uses more and more of my RAM` 
as expected behavior.  If you are looking at the output of `free`:
   
   ```
                  total        used        free      shared  buff/cache   
available
   Mem:            31Gi       9.8Gi        10Gi       262Mi        11Gi        
20Gi
   Swap:           63Gi       5.5Gi        58Gi
   ```
   
   You should see `free` drop to 0 and `buff/cache` grow to consume all RAM.  
This is normal.
   
   However, you should not see the RSS of the process increase without bound.   
You should not see `used` increase without bound.  You also shouldn't see 
`Program terminated with signal SIGKILL, Killed.`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to