hahazyb201 opened a new issue, #5731:
URL: https://github.com/apache/incubator-gluten/issues/5731

   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   In the DAG, when I observe the "shuffle write time total" metric, I found it 
was much bigger than I expected. So I dive deep into the gluten code and found 
that the writeTime_ was added twice into the final metric by 
writeMetrics.incWriteTime.
   <img width="371" alt="截屏2024-05-13 17 54 33" 
src="https://github.com/apache/incubator-gluten/assets/20397108/bd65828d-c609-4a1f-8330-8ad130aca82c";>
   
   In the VeloxCelebornHashBasedColumnarShuffleWriter.scala file, [write 
time](https://github.com/apache/incubator-gluten/blob/main/gluten-celeborn/velox/src/main/scala/org/apache/spark/shuffle/VeloxCelebornHashBasedColumnarShuffleWriter.scala#L155)
 was calculated as the sum of splitResult.getTotalWriteTime + 
splitResult.getTotalPushTime. And the totalWriteTime is accumulated here by 
this 
[line](https://github.com/apache/incubator-gluten/blob/main/cpp/core/shuffle/Payload.cc#L238)
 . The totalPushTime is accumulated 
[here](https://github.com/apache/incubator-gluten/blob/main/cpp/core/shuffle/rss/RssPartitionWriter.cc#L60)
 by the spillTime_ variable. And it's obvious that the spillTime_ includes 
writeTime_ which means writeTime_ was added twice in the final [write 
time](https://github.com/apache/incubator-gluten/blob/main/gluten-celeborn/velox/src/main/scala/org/apache/spark/shuffle/VeloxCelebornHashBasedColumnarShuffleWriter.scala#L155)
 metric.
   
   In order to fix it, I propose moving the ScopedTimer 
[line](https://github.com/apache/incubator-gluten/blob/main/cpp/core/shuffle/rss/RssPartitionWriter.cc#L60)
 a few lines down. 
   <img width="625" alt="截屏2024-05-13 19 07 51" 
src="https://github.com/apache/incubator-gluten/assets/20397108/e644e860-bfe5-4e80-90ef-852767913388";>
   
   Let me know if you want me to open a PR. Thanks.
   
   
   
   ### Spark version
   
   Spark-3.2.x
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to