[
https://issues.apache.org/jira/browse/ARROW-6579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394270#comment-17394270
]
Weston Pace commented on ARROW-6579:
------------------------------------
I'm pretty sure this has already been done. write_to_dataset will use
FileSystemDataset::Write (provided use_legacy_dataset is False) which runs in
parallel.
> [Python] Parallel pyarrow.parquet.write_to_dataset
> --------------------------------------------------
>
> Key: ARROW-6579
> URL: https://issues.apache.org/jira/browse/ARROW-6579
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.14.1
> Reporter: Adam Lippai
> Priority: Major
> Labels: dataset, dataset-parquet-write, parquet
>
> pyarrow.parquet.write_to_dataset() is single-threaded now and converts the
> table from/to Pandas. We should lower the dataset writing to C++ (dropping
> Pandas usage) so it's easier to write the partitioned dataset using multiple
> threads.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)