Re: [PR] [VL] Fix dump benchmark data issue in same task [incubator-gluten]

via GitHub Wed, 23 Apr 2025 13:37:02 -0700


marin-ma commented on code in PR #9360:
URL: https://github.com/apache/incubator-gluten/pull/9360#discussion_r2056832237



##########
docs/developers/MicroBenchmarks.md:
##########
@@ -61,9 +62,10 @@ the input files:
 ```shell
 cd /path/to/gluten/cpp/build/velox/benchmarks
 ./generic_benchmark \
---plan 
/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example.json
 \
---data 
/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example_orders/part-00000-1e66fb98-4dd6-47a6-8679-8625dbc437ee-c000.snappy.parquet,\
-/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example_lineitem/part-00000-3ec19189-d20e-4240-85ae-88631d46b612-c000.snappy.parquet
 \
+--plan 
<path-to-gluten>/backends-velox/generated-native-benchmark/plan_{stageId}_{partitionId}_{vId}.json
 \
+--data 
<path-to-gluten>/backends-velox/generated-native-benchmark/data_{stageId}_{partitionId}_{vId}_{idx}.parquet,\

Review Comment:
   idx -> iteratorIdx



##########
docs/developers/MicroBenchmarks.md:
##########
@@ -120,23 +122,23 @@ And then re-run the query with below configurations to 
dump the inputs to micro
 
 | Parameters                                  | Description                    
                                                                                
                                                                                
                                                                                
                                                             | Recommend 
Setting                                          |
 
|---------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
-| spark.gluten.sql.benchmark_task.taskId      | Comma-separated string to 
specify the Task IDs to dump. If it's set, 
`spark.gluten.sql.benchmark_task.stageId` and 
`spark.gluten.sql.benchmark_task.partitionId` will be ignored.                  
                                                                                
                                                         | Comma-separated 
string of task IDs. Empty by default.      |
+| spark.gluten.sql.benchmark_task.partitionId | Comma-separated string to 
specify the Task IDs to dump. If it's set, 
`spark.gluten.sql.benchmark_task.stageId` and 
`spark.gluten.sql.benchmark_task.partitionId` will be ignored.                  
                                                                                
                                                         | Comma-separated 
string of task IDs. Empty by default.      |

Review Comment:
   Unintentional change?



##########
docs/developers/MicroBenchmarks.md:
##########
@@ -43,16 +43,17 @@ mvn test -Pspark-3.2 -Pbackends-velox -pl backends-velox 
-am \
 
 The generated example files are placed in gluten/backends-velox:
 
+- stageId means Spark stage id in web ui.
+- partitionedId means Spark partition id in web stage ui.
+- vId means backends internal virtual id.
+
 ```shell
 $ tree gluten/backends-velox/generated-native-benchmark/
 gluten/backends-velox/generated-native-benchmark/
-├── example.json
-├── example_lineitem
-│   ├── part-00000-3ec19189-d20e-4240-85ae-88631d46b612-c000.snappy.parquet
-│   └── _SUCCESS
-└── example_orders
-    ├── part-00000-1e66fb98-4dd6-47a6-8679-8625dbc437ee-c000.snappy.parquet
-    └── _SUCCESS
+├── plan_{stageId}_{partitionId}_{vId}.json

Review Comment:
   This is an example to demonstrate the real structure of the generated files. 
Let's use the actual file name.



##########
.github/workflows/velox_backend.yml:
##########
@@ -1263,9 +1263,9 @@ jobs:
           -DtagsToInclude="org.apache.gluten.tags.GenerateExample" -Dtest=none 
-DfailIfNoTests=false -Dexec.skip
           # This test depends on files generated by the above mvn test.
           ./cpp/build/velox/benchmarks/generic_benchmark --with-shuffle 
--partitioning hash --threads 1 --iterations 1 \
-          --conf $(realpath 
backends-velox/generated-native-benchmark/conf_12_0.ini) \
-          --plan $(realpath 
backends-velox/generated-native-benchmark/plan_12_0.json) \
-          --data $(realpath 
backends-velox/generated-native-benchmark/data_12_0_0.parquet),$(realpath 
backends-velox/generated-native-benchmark/data_12_0_1.parquet)
+          --conf $(realpath 
backends-velox/generated-native-benchmark/conf_12_*.ini) \

Review Comment:
   The wildcard can be misleading to the readers or users. If `partitionId` is 
not specified, then there can be multiple partitions get dumped. But 
microbenchmark can only accept one partition per each run.



##########
docs/developers/MicroBenchmarks.md:
##########
@@ -199,22 +201,22 @@ Sample command:
 ```shell
 cd /path/to/gluten/cpp/build/velox/benchmarks
 ./generic_benchmark \
---conf /absolute_path/to/conf_[stageId]_[partitionId].ini \
---plan /absolute_path/to/plan_[stageId]_[partitionId].json \
---data 
/absolut_path/to/data_[stageId]_[partitionId]_0.parquet,/absolut_path/to/data_[stageId]_[partitionId]_1.parquet
 \
+--conf /absolute_path/to/conf_{stageId}_{partitionId}_{vId}.ini \
+--plan /absolute_path/to/plan_{stageId}_{partitionId}_{vId}.json \
+--data 
/absolut_path/to/data_{stageId}_{partitionId}_{vId}_{iteratorIdx}.parquet,/absolut_path/to/data_{stageId}_{partitionId}_{vId}_{iteratorIdx}.parquet
 \
 --threads 1 --noprint-result
 ```
 
 For some complex queries, stageId may cannot represent the Substrait plan 
input, please get the
-taskId from spark UI, and get your target parquet from saveDir.
+partitionId from spark UI, and get your target parquet from saveDir.

Review Comment:
   This part of change is not correct... taskId is useful when the task cannot 
be dumped by using the combination of stage id + partition id.



##########
cpp/velox/compute/VeloxRuntime.cc:
##########
@@ -80,10 +108,14 @@ VeloxRuntime::VeloxRuntime(
       kMemoryPoolCapacityTransferAcrossTasks, 
FLAGS_velox_memory_pool_capacity_transfer_across_tasks);
 }
 
-void VeloxRuntime::parsePlan(const uint8_t* data, int32_t size, 
std::optional<std::string> dumpFile) {
-  if (debugModeEnabled_ || dumpFile.has_value()) {
+void VeloxRuntime::parsePlan(const uint8_t* data, int32_t size, bool dump) {
+  if (debugModeEnabled_ || dump) {
     try {
-      auto planJson = substraitFromPbToJson("Plan", data, size, dumpFile);
+      auto planJson = substraitFromPbToJson("Plan", data, size);
+      if (dump) {
+        auto dumpFile = fmt::format("plan_{}_{}_{}.json", taskInfo_.stageId, 
taskInfo_.taskId, taskInfo_.vId);

Review Comment:
   Seems like this one wasn't resolved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [VL] Fix dump benchmark data issue in same task [incubator-gluten]

Reply via email to