westonpace commented on issue #40224:
URL: https://github.com/apache/arrow/issues/40224#issuecomment-1991754534
> Sorry, I'm a bit lost here; what are the implications of write versus sink
here?
I suppose it is more about "two plans" vs. "one plan".
A plan's output can be a record batch reader (the final node is a sink
node). A plan's input can be a record batch reader. A write plan has no
output (the final node is a write node).
So, to scan and rewrite a dataset, you can have two plans:
Plan 1: Scan(files) -> Project -> Sink
Plan 2: Scan(record_batch_reader) -> Write
Or you can do it all in one plan:
Combined: Scan(files) -> Project -> Write
In python, since there is no equivalent of dplyr, the user cannot write a
"single statement" like `open_dataset |> mutate(...) |> write_dataset(...)`.
Instead the user has to do something like...
```
projected = dataset.to_reader(...) # Creates an acero plan
ds.write_dataset(projected) # Creates a second acero plan
```
However, since you have a single dplyr plan in R, it might be possible to
make a single acero plan. That being said, I don't expect it to have much
impact.
> I was just looking at free/available memory - would RSS of the process be
better?
When writing a dataset the server will generally use all available memory.
This is because a "write" call just copies data from the process RSS to the
kernel page cache. It doesn't wait and block until all the data is persisted
to disk. So you will generally see `it gradually uses more and more of my RAM`
as expected behavior. If you are looking at the output of `free`:
```
total used free shared buff/cache
available
Mem: 31Gi 9.8Gi 10Gi 262Mi 11Gi
20Gi
Swap: 63Gi 5.5Gi 58Gi
```
You should see `free` drop to 0 and `buff/cache` grow to consume all RAM.
This is normal.
However, you should not see the RSS of the process increase without bound.
You should not see `used` increase without bound. You also shouldn't see
`Program terminated with signal SIGKILL, Killed.`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]