[
https://issues.apache.org/jira/browse/ARROW-13542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393178#comment-17393178
]
Ben Kietzman commented on ARROW-13542:
--------------------------------------
Currently I was thinking that partitioning would be handled within this node,
since that'd be the most straightforward extraction of a node from
FileSystemDataset::Write.
If you wanted to extract a compute::PartitionNode instead, that'd probably be
useful later on. I think PartitionNode would:
- use a Grouper for id-ing their destination partition
- sort batches by their partition id
- emit slices of input batches with equal partition id
- the partition expression is stored in ExecBatch::guarantee
(note: does not utilize a dataset::Partitioning)
Then WriteNode would only use a Partitioning to format ExecBatch::guarantees to
an output directory. I think this approach would allow us to delete
Partitioning::Partition too, since that behavior would now be encapsulated by
PartitionNode.
Also note that whatever approach you take is going to impinge on ARROW-13338
since ExecPlans don't support sync scanning and FileSystemDataset::Write
depends on [[deprecated]] Scanner::Scan
> [C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an
> ExecPlan to disk
> ----------------------------------------------------------------------------------------
>
> Key: ARROW-13542
> URL: https://issues.apache.org/jira/browse/ARROW-13542
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Ben Kietzman
> Assignee: Weston Pace
> Priority: Major
> Labels: dataset
>
> This will serve as a sink ExecNode which dumps all the batches it receives to
> disk. The PR should probably also replace {{FileSystemDataset::Write}} with
> an ExecPlan based implementation
--
This message was sent by Atlassian Jira
(v8.3.4#803005)