kazantsev-maksim opened a new pull request, #3044: URL: https://github.com/apache/datafusion-comet/pull/3044
## Which issue does this PR close? - N/A ## Rationale for this change Added an experimental implementation of native CSV file reading (currently only for DataSourceV2 version) Required improvements: 1. Conduct more benchmark tests 2. Try to implement the idea from - https://github.com/apache/datafusion-comet/issues/882 3. Test reading files from S3/HDFS (currently only tested on local files) Results of simple benchmark test (1 iteration) Running benchmark: Native csv read: orders table Running case: Spark Stopped after 1 iterations, 238272 ms Running case: Comet Stopped after 1 iterations, 55326 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: orders table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 238272 238272 0 0,0 113617,0 1,0X Comet 55327 55327 0 0,0 26381,9 4,3X Running benchmark: Native csv read: region table Running case: Spark Stopped after 1 iterations, 36 ms Running case: Comet Stopped after 1 iterations, 38 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: region table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 36 36 0 57,7 17,3 1,0X Comet 39 39 0 54,0 18,5 0,9X Running benchmark: Native csv read: nation table Running case: Spark Stopped after 1 iterations, 35 ms Running case: Comet Stopped after 1 iterations, 39 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: nation table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 36 36 0 59,0 16,9 1,0X Comet 39 39 0 53,4 18,7 0,9X Running benchmark: Native csv read: part table Running case: Spark Stopped after 1 iterations, 28883 ms Running case: Comet Stopped after 1 iterations, 7870 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: part table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 28884 28884 0 0,1 13773,0 1,0X Comet 7871 7871 0 0,3 3753,2 3,7X Running benchmark: Native csv read: supplier table Running case: Spark Stopped after 1 iterations, 1380 ms Running case: Comet Stopped after 1 iterations, 413 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: supplier table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 1381 1381 0 1,5 658,3 1,0X Comet 414 414 0 5,1 197,2 3,3X Running benchmark: Native csv read: partsupp table Running case: Spark Stopped after 1 iterations, 107904 ms Running case: Comet Stopped after 1 iterations, 26308 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: partsupp table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 107905 107905 0 0,0 51452,9 1,0X Comet 26308 26308 0 0,1 12544,8 4,1X Running benchmark: Native csv read: customer table Running case: Spark Stopped after 1 iterations, 22288 ms Running case: Comet Stopped after 1 iterations, 6089 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: customer table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 22289 22289 0 0,1 10628,0 1,0X Comet 6090 6090 0 0,3 2903,8 3,7X Running benchmark: Native csv read: lineitem table Running case: Spark Stopped after 1 iterations, 1331928 ms Running case: Comet Stopped after 1 iterations, 342292 ms OpenJDK 64-Bit Server VM 11.0.26+4-LTS on Mac OS X 26.2 Apple M1 Pro Native csv read: lineitem table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Spark 1331928 1331928 0 0,0 635112,8 1,0X Comet 342293 342293 0 0,0 163218,0 3,9X ## How are these changes tested? 1. Added new unit test 2. Added simple benchmark test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
