marin-ma commented on code in PR #9360:
URL: https://github.com/apache/incubator-gluten/pull/9360#discussion_r2056832237
##########
docs/developers/MicroBenchmarks.md:
##########
@@ -61,9 +62,10 @@ the input files:
```shell
cd /path/to/gluten/cpp/build/velox/benchmarks
./generic_benchmark \
---plan
/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example.json
\
---data
/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example_orders/part-00000-1e66fb98-4dd6-47a6-8679-8625dbc437ee-c000.snappy.parquet,\
-/home/sparkuser/github/apache/incubator-gluten/backends-velox/generated-native-benchmark/example_lineitem/part-00000-3ec19189-d20e-4240-85ae-88631d46b612-c000.snappy.parquet
\
+--plan
<path-to-gluten>/backends-velox/generated-native-benchmark/plan_{stageId}_{partitionId}_{vId}.json
\
+--data
<path-to-gluten>/backends-velox/generated-native-benchmark/data_{stageId}_{partitionId}_{vId}_{idx}.parquet,\
Review Comment:
idx -> iteratorIdx
##########
docs/developers/MicroBenchmarks.md:
##########
@@ -120,23 +122,23 @@ And then re-run the query with below configurations to
dump the inputs to micro
| Parameters | Description
| Recommend
Setting |
|---------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|
-| spark.gluten.sql.benchmark_task.taskId | Comma-separated string to
specify the Task IDs to dump. If it's set,
`spark.gluten.sql.benchmark_task.stageId` and
`spark.gluten.sql.benchmark_task.partitionId` will be ignored.
| Comma-separated
string of task IDs. Empty by default. |
+| spark.gluten.sql.benchmark_task.partitionId | Comma-separated string to
specify the Task IDs to dump. If it's set,
`spark.gluten.sql.benchmark_task.stageId` and
`spark.gluten.sql.benchmark_task.partitionId` will be ignored.
| Comma-separated
string of task IDs. Empty by default. |
Review Comment:
Unintentional change?
##########
docs/developers/MicroBenchmarks.md:
##########
@@ -43,16 +43,17 @@ mvn test -Pspark-3.2 -Pbackends-velox -pl backends-velox
-am \
The generated example files are placed in gluten/backends-velox:
+- stageId means Spark stage id in web ui.
+- partitionedId means Spark partition id in web stage ui.
+- vId means backends internal virtual id.
+
```shell
$ tree gluten/backends-velox/generated-native-benchmark/
gluten/backends-velox/generated-native-benchmark/
-├── example.json
-├── example_lineitem
-│ ├── part-00000-3ec19189-d20e-4240-85ae-88631d46b612-c000.snappy.parquet
-│ └── _SUCCESS
-└── example_orders
- ├── part-00000-1e66fb98-4dd6-47a6-8679-8625dbc437ee-c000.snappy.parquet
- └── _SUCCESS
+├── plan_{stageId}_{partitionId}_{vId}.json
Review Comment:
This is an example to demonstrate the real structure of the generated files.
Let's use the actual file name.
##########
.github/workflows/velox_backend.yml:
##########
@@ -1263,9 +1263,9 @@ jobs:
-DtagsToInclude="org.apache.gluten.tags.GenerateExample" -Dtest=none
-DfailIfNoTests=false -Dexec.skip
# This test depends on files generated by the above mvn test.
./cpp/build/velox/benchmarks/generic_benchmark --with-shuffle
--partitioning hash --threads 1 --iterations 1 \
- --conf $(realpath
backends-velox/generated-native-benchmark/conf_12_0.ini) \
- --plan $(realpath
backends-velox/generated-native-benchmark/plan_12_0.json) \
- --data $(realpath
backends-velox/generated-native-benchmark/data_12_0_0.parquet),$(realpath
backends-velox/generated-native-benchmark/data_12_0_1.parquet)
+ --conf $(realpath
backends-velox/generated-native-benchmark/conf_12_*.ini) \
Review Comment:
The wildcard can be misleading to the readers or users. If `partitionId` is
not specified, then there can be multiple partitions get dumped. But
microbenchmark can only accept one partition per each run.
##########
docs/developers/MicroBenchmarks.md:
##########
@@ -199,22 +201,22 @@ Sample command:
```shell
cd /path/to/gluten/cpp/build/velox/benchmarks
./generic_benchmark \
---conf /absolute_path/to/conf_[stageId]_[partitionId].ini \
---plan /absolute_path/to/plan_[stageId]_[partitionId].json \
---data
/absolut_path/to/data_[stageId]_[partitionId]_0.parquet,/absolut_path/to/data_[stageId]_[partitionId]_1.parquet
\
+--conf /absolute_path/to/conf_{stageId}_{partitionId}_{vId}.ini \
+--plan /absolute_path/to/plan_{stageId}_{partitionId}_{vId}.json \
+--data
/absolut_path/to/data_{stageId}_{partitionId}_{vId}_{iteratorIdx}.parquet,/absolut_path/to/data_{stageId}_{partitionId}_{vId}_{iteratorIdx}.parquet
\
--threads 1 --noprint-result
```
For some complex queries, stageId may cannot represent the Substrait plan
input, please get the
-taskId from spark UI, and get your target parquet from saveDir.
+partitionId from spark UI, and get your target parquet from saveDir.
Review Comment:
This part of change is not correct... taskId is useful when the task cannot
be dumped by using the combination of stage id + partition id.
##########
cpp/velox/compute/VeloxRuntime.cc:
##########
@@ -80,10 +108,14 @@ VeloxRuntime::VeloxRuntime(
kMemoryPoolCapacityTransferAcrossTasks,
FLAGS_velox_memory_pool_capacity_transfer_across_tasks);
}
-void VeloxRuntime::parsePlan(const uint8_t* data, int32_t size,
std::optional<std::string> dumpFile) {
- if (debugModeEnabled_ || dumpFile.has_value()) {
+void VeloxRuntime::parsePlan(const uint8_t* data, int32_t size, bool dump) {
+ if (debugModeEnabled_ || dump) {
try {
- auto planJson = substraitFromPbToJson("Plan", data, size, dumpFile);
+ auto planJson = substraitFromPbToJson("Plan", data, size);
+ if (dump) {
+ auto dumpFile = fmt::format("plan_{}_{}_{}.json", taskInfo_.stageId,
taskInfo_.taskId, taskInfo_.vId);
Review Comment:
Seems like this one wasn't resolved.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]