[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #459: Refactor QueryStageExec in preparation for implementing map-side shuffle

GitBox Mon, 31 May 2021 12:59:46 -0700


Dandandan commented on a change in pull request #459:
URL: https://github.com/apache/arrow-datafusion/pull/459#discussion_r642658594




##########
File path: ballista/rust/core/src/execution_plans/query_stage.rs
##########
@@ -77,16 +109,142 @@ impl ExecutionPlan for QueryStageExec {
     ) -> Result<Arc<dyn ExecutionPlan>> {
         assert!(children.len() == 1);
         Ok(Arc::new(QueryStageExec::try_new(
-            self.job_id.clone(),
+            &self.job_id,
             self.stage_id,
             children[0].clone(),
+            &self.work_dir,
+            None,
         )?))
     }
 
     async fn execute(
         &self,
         partition: usize,
     ) -> Result<Pin<Box<dyn RecordBatchStream + Send + Sync>>> {
-        self.child.execute(partition).await
+        let now = Instant::now();
+
+        let mut stream = self.plan.execute(partition).await?;
+
+        let mut path = PathBuf::from(&self.work_dir);
+        path.push(&self.job_id);
+        path.push(&format!("{}", self.stage_id));
+
+        match &self.shuffle_output_partitioning {
+            None => {
+                path.push(&format!("{}", partition));
+                std::fs::create_dir_all(&path)?;
+
+                path.push("data.arrow");
+                let path = path.to_str().unwrap();
+                info!("Writing results to {}", path);
+
+                // stream results to disk
+                let stats = utils::write_stream_to_disk(&mut stream, &path)
+                    .await
+                    .map_err(|e| DataFusionError::Execution(format!("{:?}", 
e)))?;
+
+                info!(
+                    "Executed partition {} in {} seconds. Statistics: {:?}",
+                    partition,
+                    now.elapsed().as_secs(),
+                    stats
+                );
+
+                let schema = Arc::new(Schema::new(vec![
+                    Field::new("path", DataType::Utf8, false),
+                    stats.arrow_struct_repr(),
+                ]));
+
+                // build result set with summary of the partition execution 
status
+                let mut c0 = StringBuilder::new(1);
+                c0.append_value(&path).unwrap();
+                let path: ArrayRef = Arc::new(c0.finish());
+
+                let stats: ArrayRef = stats
+                    .to_arrow_arrayref()
+                    .map_err(|e| DataFusionError::Execution(format!("{:?}", 
e)))?;
+                let batch = RecordBatch::try_new(schema.clone(), vec![path, 
stats])
+                    .map_err(DataFusionError::ArrowError)?;
+
+                Ok(Box::pin(MemoryStream::try_new(vec![batch], schema, None)?))
+            }
+
+            Some(Partitioning::Hash(_, _)) => {
+                //TODO re-use code from RepartitionExec to split each batch 
into
+                // partitions and write to one IPC file per partition
+                // See https://github.com/apache/arrow-datafusion/issues/456
+                unimplemented!()

Review comment:
       Could use `DataFusionError:: NotImplemented`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #459: Refactor QueryStageExec in preparation for implementing map-side shuffle

Reply via email to