Partition classes

Francois Saint-Jacques (Jira) Wed, 22 Apr 2020 07:43:27 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Francois Saint-Jacques updated ARROW-8382:
------------------------------------------
    Description: 
WritePlan should look like the following.
{code:c++}
class ARROW_DS_EXPORT WritePlan {
 public:
  /// Execute the WritePlan and return a FileSystemDataset as a result.
 Result<FileSystemDataset> Execute(FileSystemDatasetFactory factory);

 protected:
  /// The schema of the Dataset which will be written
  std::shared_ptr<Schema> schema;

  /// The format into which fragments will be written
  std::shared_ptr<FileFormat> format;
 
  using SourceAndReader = std::pair<FIleSource, RecordBatchReader>;
  /// Files to write
  std::vector<SourceAndReader> outputs;
};
{code}
 * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not 
sure if it should take the output schema, or the RecordBatchReader should be 
already of the right schema.
 * Add a class/function that constructs SourceAndReader from Fragments, 
Partitioning and base path.
 * Move Write() out FIleSystemDataset into WritePlan. It could take a 
FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, 
not a requirement.
 * Simplify writing routine to avoid the PathTree directory structure, it 
shouldn't be more complex than `for task in write_tasks: task()`. Not path 
construction should be there.
 * Move the logic of dropping columns or any filtering into a custom 
RecordBatchReader.

The effects are:
 * Simplified WritePlan execution, abstracted away from path construction, and 
can write to multiple FileSystem and/or Buffers since it doesn't construct the 
FileSource.
 * By the virtue of using RecordBatchReader instead of Fragment, it isn't tied 
to writing from Fragment, it can take any construct that yields a 
RecordBatchReader. It also means that WritePlan doesn't have to know about any 
Scan related classes.
 * Writing can be done with or without partitioning, this logic is given to 
whomever generates the SourceAndReader list.
 * Should be simpler to test.

  was:
WritePlan should look like the following. 

{code:c++}
class ARROW_DS_EXPORT WritePlan {
 public:
  /// Execute the WritePlan and return a FileSystemDataset as a result.
 Result<FileSystemDataset> Execute(FileSystemDatasetFactory factory);

 protected:
  /// The schema of the Dataset which will be written
  std::shared_ptr<Schema> schema;

  /// The format into which fragments will be written
  std::shared_ptr<FileFormat> format;
 
  using SourceAndReader = std::pair<FIleSource, RecordBatchReader>;
  /// Files to write
  std::vector<SourceAndReader> outputs;
};
{code}

* Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not 
sure if it should take the output schema, or the RecordBatchReader should be 
already of the right schema.
* Add a class/function that constructs SourceAndReader from Fragments, 
Partitioning and base path.
* Move Write() out FIleSystemDataset into WritePlan. It could take a 
FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, 
not a requirement.
* Simplify writing routine to avoid the PathTree directory structure, it 
shouldn't be more complex than `for task in write_tasks: task()`. Not path 
construction should be there.

The effects are:
* Simplified WritePlan execution, abstracted away from path construction, and 
can write to multiple FileSystem and/or Buffers since it doesn't construct the 
FileSource.
* By the virtue of using RecordBatchReader instead of Fragment, it isn't tied 
to writing from Fragment, it can take any construct that yields a 
RecordBatchReader. It also means that WritePlan doesn't have to know about any 
Scan related classes.
* Writing can be done with or without partitioning, this logic is given to 
whomever generates the SourceAndReader list.
* Should be simpler to test.


> [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition 
> classes 
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-8382
>                 URL: https://issues.apache.org/jira/browse/ARROW-8382
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Francois Saint-Jacques
>            Priority: Major
>              Labels: dataset
>
> WritePlan should look like the following.
> {code:c++}
> class ARROW_DS_EXPORT WritePlan {
>  public:
>   /// Execute the WritePlan and return a FileSystemDataset as a result.
>  Result<FileSystemDataset> Execute(FileSystemDatasetFactory factory);
>  protected:
>   /// The schema of the Dataset which will be written
>   std::shared_ptr<Schema> schema;
>   /// The format into which fragments will be written
>   std::shared_ptr<FileFormat> format;
>  
>   using SourceAndReader = std::pair<FIleSource, RecordBatchReader>;
>   /// Files to write
>   std::vector<SourceAndReader> outputs;
> };
> {code}
>  * Refactor FileFormat::Write(FileSource destination, RecordBatchReader), not 
> sure if it should take the output schema, or the RecordBatchReader should be 
> already of the right schema.
>  * Add a class/function that constructs SourceAndReader from Fragments, 
> Partitioning and base path.
>  * Move Write() out FIleSystemDataset into WritePlan. It could take a 
> FileSystemDatasetFactory to recreate the FileSystemDataset. This is a bonus, 
> not a requirement.
>  * Simplify writing routine to avoid the PathTree directory structure, it 
> shouldn't be more complex than `for task in write_tasks: task()`. Not path 
> construction should be there.
>  * Move the logic of dropping columns or any filtering into a custom 
> RecordBatchReader.
> The effects are:
>  * Simplified WritePlan execution, abstracted away from path construction, 
> and can write to multiple FileSystem and/or Buffers since it doesn't 
> construct the FileSource.
>  * By the virtue of using RecordBatchReader instead of Fragment, it isn't 
> tied to writing from Fragment, it can take any construct that yields a 
> RecordBatchReader. It also means that WritePlan doesn't have to know about 
> any Scan related classes.
>  * Writing can be done with or without partitioning, this logic is given to 
> whomever generates the SourceAndReader list.
>  * Should be simpler to test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8382) [C++][Dataset] Refactor WritePlan to decouple from Fragment/Scan/Partition classes

Reply via email to