Weston Pace created ARROW-12811:
-----------------------------------
Summary: [C++] [Dataset] Dataset repartition / filter / update
Key: ARROW-12811
URL: https://issues.apache.org/jira/browse/ARROW-12811
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
This feature would be to add support for an "update" workflow which scanned a
set of batches and wrote them (potentially filtered/modified) back out to the
same place.
The existing dataset read / dataset write features wouldn't work because they
would append the data.
There is some discussion in ARROW-12358 and ARROW-12509 of an "overwrite mode"
but an "overwrite partition" workflow wouldn't work unless you can scan in
entire partitions at once (and in general this should probably be avoided).
A naive "write to a different directory and rename" approach could work but it
would be inefficient since it would require a copy of the entire dataset to
modify a small part of it.
The feature could be implemented using temporary directories in place that get
renamed on top of the existing directory at the end. Files that are unchanged
would be moved into the temporary directory instead of copied.
Presumable no ACID guarantees would be made (and they would be quite hard to
guarantee) since Arrow datasets do not make ACID guarantees of any kind
currently.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)