[
https://issues.apache.org/jira/browse/ARROW-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Kietzman closed ARROW-8382.
-------------------------------
Resolution: Abandoned
> [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition
> classes
> -----------------------------------------------------------------------------------
>
> Key: ARROW-8382
> URL: https://issues.apache.org/jira/browse/ARROW-8382
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Francois Saint-Jacques
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset
>
> WritePlan should look like the following.
> {code:c++}
> class ARROW_DS_EXPORT WritePlan {
> public:
> /// Execute the WritePlan and return a FileSystemDataset as a result.
> Result<FileSystemDataset> Execute(FileSystemDatasetFactory factory);
> protected:
> /// The schema of the Dataset which will be written
> std::shared_ptr<Schema> schema;
> /// The format into which fragments will be written
> std::shared_ptr<FileFormat> format;
>
> using SourceAndReader = std::pair<FIleSource, RecordBatchReader>;
> /// Files to write
> std::vector<SourceAndReader> outputs;
> };
> {code}
> * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not
> sure if it should take the output schema, or the RecordBatchReader should be
> already of the right schema.
> * Add a class/function that constructs SourceAndReader from Fragments,
> Partitioning and base path.
> * Move Write() out FIleSystemDataset into WritePlan. It could take a
> FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus,
> not a requirement.
> * Simplify writing routine to avoid the PathTree directory structure, it
> shouldn't be more complex than `for task in write_tasks: task()`. Not path
> construction should be there.
> * Move the logic of dropping columns or any filtering into a custom
> RecordBatchReader.
> The effects are:
> * Simplified WritePlan execution, abstracted away from path construction,
> and can write to multiple FileSystem and/or Buffers since it doesn't
> construct the FileSource.
> * By the virtue of using RecordBatchReader instead of Fragment, it isn't
> tied to writing from Fragment, it can take any construct that yields a
> RecordBatchReader. It also means that WritePlan doesn't have to know about
> any Scan related classes.
> * Writing can be done with or without partitioning, this logic is given to
> whomever generates the SourceAndReader list.
> * Should be simpler to test.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)