[GitHub] [arrow-datafusion] devinjdangelo commented on a diff in pull request #6987: use ObjectStore for dataframe writes

via GitHub Mon, 17 Jul 2023 15:50:10 -0700


devinjdangelo commented on code in PR #6987:
URL: https://github.com/apache/arrow-datafusion/pull/6987#discussion_r1265974876



##########
datafusion/core/src/datasource/physical_plan/csv.rs:
##########
@@ -566,30 +564,32 @@ pub async fn plan_to_csv(
     path: impl AsRef<str>,
 ) -> Result<()> {
     let path = path.as_ref();
-    // create directory to contain the CSV files (one per partition)
-    let fs_path = Path::new(path);
-    if let Err(e) = fs::create_dir(fs_path) {
-        return Err(DataFusionError::Execution(format!(
-            "Could not create directory {path}: {e:?}"
-        )));
-    }
-
+    let parsed = ListingTableUrl::parse(path)?;
+    let object_store_url = parsed.object_store();
+    let store = task_ctx.runtime_env().object_store(&object_store_url)?;
+    let mut buffer;
     let mut join_set = JoinSet::new();
     for i in 0..plan.output_partitioning().partition_count() {
-        let plan = plan.clone();
-        let filename = format!("part-{i}.csv");
-        let path = fs_path.join(filename);
-        let file = fs::File::create(path)?;
-        let mut writer = csv::Writer::new(file);
-        let stream = plan.execute(i, task_ctx.clone())?;
+        let storeref = store.clone();
+        let plan: Arc<dyn ExecutionPlan> = plan.clone();
+        let filename = format!("{}/part-{i}.csv", parsed.prefix());
+        let file = object_store::path::Path::parse(filename)?;
+        buffer = Vec::new();

Review Comment:
   Yes, everything is buffered into memory and uploaded in a single put, which 
will be very memory inefficient if you are writing large files.
   
   If you are writing many small files, multipart upload could add unnecessary 
overhead creating and finalizing the uploads which may only have a single part. 
   
   If we must choose only one or the other, I would also favor multipart 
upload, since large files could fail in the current implementation, whereas 
small files would at worst be slower in a multipart implementation. I will work 
on a multipart implementation of this!
   
   This is probably too ambitious for this PR, but we could in the future 
automatically select the optimal choice between put and multipart-put based on 
information in the `ExecutionPlan` (maybe 
[this](https://github.com/apache/arrow-datafusion/blob/9338880d8f64f8143e348a60beee8af2789fa8ae/datafusion/common/src/stats.rs#L31)?).
 I.e. if you anticipate files on average >threshold_bytes, do a streaming 
multipart-put, otherwise a single put. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo commented on a diff in pull request #6987: use ObjectStore for dataframe writes

Reply via email to