zhztheplayer opened a new issue, #10778: URL: https://github.com/apache/incubator-gluten/issues/10778
### Backend VL (Velox) ### Bug description On a TPC-DS dataset where 182300 files (which is a lot) generated for table `web_sales` (similarly, other fact tables all include many small files), Gluten is much slower than vanilla Spark when reading the data. Code versions: Latest Glute code + Spark 3.4 + Delta 2.4 See test report: ``` Test report: Summary: 5 out of 5 queries passed. | Query ID | Passed | Row Count | Planning Time (Millis) | Query Time (Millis) | Speedup | | | | Vanilla | Gluten | Vanilla | Gluten | Vanilla | Gluten | | |----------|--------|---------|--------|------------|------------|-----------|-----------|---------| | q1| true| 100| 100| 13348| 11773| 27781| 9613| 188.99%| | q2| true| 2513| 2513| 5610| 5452| 81757| 277679| -70.56%| | q3| true| 100| 100| 3969| 4366| 31489| 18550| 69.75%| | q4| true| 100| 100| 2140| 2392| 300999| 658322| -54.28%| | q5| true| 100| 100| 4013| 3674| 219881| 129141| 70.26%| | all| true| 2913| 2913| 29080| 27657| 661907| 1093305| -39.46%| No failed queries. ``` Test command used (gluten-it): ``` sbin/gluten-it.sh queries-compare --benchmark-type=ds --data-gen=once --local-cluster --auto-cluster-resource --off-heap-ratio=0.5 --enable-history --enable-ui --gen-partitioned-data -s=1000.0 --data-source=delta --data-dir=/root/data --extra-conf=spark.gluten.sql.columnar.scanOnly=true --queries=q1,q2,q3,q4,q5 --shuffle-partitions=100 ``` Hardware (64 CPUs + 256 GiB RAM): ``` Gluten Version: 1.6.0-SNAPSHOT Commit: a1edfafcd4025440caef8bba5a0d5a1c432c2480 CMake Version: 3.28.3 System: Linux-6.1.141-155.222.amzn2023.x86_64 Arch: x86_64 CPU Name: Model name: Intel(R) Xeon(R) Platinum 8488C C++ Compiler: /usr/bin/c++ C++ Compiler Version: 13.3.0 C Compiler: /usr/bin/cc C Compiler Version: 13.3.0 CMake Prefix Path: /usr/local;/usr;/;/root/.local/share/uv/tools/cmake/lib/python3.12/site-packages/cmake/data;/usr/local;/usr/X11R6;/usr/pkg;/opt ``` ### Gluten version _No response_ ### Spark version None ### Spark configurations _No response_ ### System information _No response_ ### Relevant logs ```bash ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
