(datafusion-comet) branch main updated: chore: Update README to highlight Comet benefits (#497)

agrove Fri, 31 May 2024 09:37:43 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git



The following commit(s) were added to refs/heads/main by this push:
     new 1f81c381 chore: Update README to highlight Comet benefits (#497)
1f81c381 is described below

commit 1f81c3812490d35b488795ad597e4e6d3f9e114d
Author: Andy Grove <[email protected]>
AuthorDate: Fri May 31 10:36:16 2024 -0600

    chore: Update README to highlight Comet benefits (#497)
    
    * Update README to show current performance info
    
    * Update README.md
    
    Co-authored-by: Oleks V <[email protected]>
    
    * address feedback
    
    * Add hardware info to benchmarking page
    
    * add raw benchmark results
    
    ---------
    
    Co-authored-by: Oleks V <[email protected]>
---
 README.md                                          |  98 ++++++----
 docs/source/_static/images/tpch_allqueries.png     | Bin 0 -> 23677 bytes
 .../source/_static/images/tpch_queries_compare.png | Bin 0 -> 27388 bytes
 .../source/_static/images/tpch_queries_speedup.png | Bin 0 -> 38821 bytes
 .../2024-05-30/comet-8-exec-5-runs.json            | 201 +++++++++++++++++++++
 .../2024-05-30/datafusion-python-8-cores.json      |  73 ++++++++
 .../2024-05-30/spark-8-exec-5-runs.json            | 184 +++++++++++++++++++
 docs/source/contributor-guide/benchmarking.md      | 113 ++++++++----
 8 files changed, 599 insertions(+), 70 deletions(-)

diff --git a/README.md b/README.md
index fb17535a..b0b72fdb 100644
--- a/README.md
+++ b/README.md
@@ -19,58 +19,86 @@ under the License.
 
 # Apache DataFusion Comet
 
-Apache DataFusion Comet is an Apache Spark plugin that uses [Apache 
DataFusion](https://datafusion.apache.org/)
-as native runtime to achieve improvement in terms of query efficiency and 
query runtime.
+Apache DataFusion Comet is a high-performance accelerator for Apache Spark, 
built on top of the powerful
+[Apache DataFusion](https://datafusion.apache.org) query engine. Comet is 
designed to significantly enhance the
+performance of Apache Spark workloads while leveraging commodity hardware and 
seamlessly integrating with the
+Spark ecosystem without requiring any code changes.
 
-Comet runs Spark SQL queries using the native DataFusion runtime, which is
-typically faster and more resource efficient than JVM based runtimes.
+# Benefits of Using Comet
 
-<a href="docs/source/_static/images/comet-overview.png"><img 
src="docs/source/_static/images/comet-system-diagram.png" align="center" 
width="500" ></a>
+## Run Spark Queries at DataFusion Speeds
 
-Comet aims to support:
+Comet delivers a performance speedup for many queries, enabling faster data 
processing and shorter time-to-insights.
 
-- a native Parquet implementation, including both reader and writer
-- full implementation of Spark operators, including
-  Filter/Project/Aggregation/Join/Exchange etc.
-- full implementation of Spark built-in expressions
-- a UDF framework for users to migrate their existing UDF to native
+The following chart shows the time it takes to run the 22 TPC-H queries 
against 100 GB of data in Parquet format 
+using a single executor with 8 cores. See the [Comet Benchmarking 
Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html)
+for details of the environment used for these benchmarks.
 
-## Architecture
+When using Comet, the overall run time is reduced from 649 seconds to 440 
seconds, a 1.5x speedup.
 
-The following diagram illustrates the architecture of Comet:
+Running the same queries with DataFusion standalone (without Spark) using the 
same number of cores results in a 3.9x 
+speedup compared to Spark.
 
-<a href="docs/source/_static/images/comet-overview.png"><img 
src="docs/source/_static/images/comet-overview.png" align="center" height="600" 
width="750" ></a>
+Comet is not yet achieving full DataFusion speeds in all cases, but with 
future work we aim to provide a 2x-4x speedup 
+for many use cases.
 
-## Current Status
+![](docs/source/_static/images/tpch_allqueries.png)
 
-The project is currently integrated into Apache Spark 3.2, 3.3, and 3.4.
+Here is a breakdown showing relative performance of Spark, Comet, and 
DataFusion for each TPC-H query.
 
-## Feature Parity with Apache Spark
+![](docs/source/_static/images/tpch_queries_compare.png)
 
-The project strives to keep feature parity with Apache Spark, that is,
-users should expect the same behavior (w.r.t features, configurations,
-query results, etc) with Comet turned on or turned off in their Spark
-jobs. In addition, Comet extension should automatically detect unsupported
-features and fallback to Spark engine.
+The following chart shows how much Comet currently accelerates each query from 
the benchmark. Performance optimization
+is an ongoing task, and we welcome contributions from the community to help 
achieve even greater speedups in the future.
 
-To achieve this, besides unit tests within Comet itself, we also re-use
-Spark SQL tests and make sure they all pass with Comet extension
-enabled.
+![](docs/source/_static/images/tpch_queries_speedup.png)
 
-## Supported Platforms
+These benchmarks can be reproduced in any environment using the documentation 
in the 
+[Comet Benchmarking 
Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html).
 We encourage 
+you to run your own benchmarks.
 
-Linux, Apple OSX (Intel and M1)
+## Use Commodity Hardware
 
-## Requirements
+Comet leverages commodity hardware, eliminating the need for costly hardware 
upgrades or
+specialized hardware accelerators, such as GPUs or FGPA. By maximizing the 
utilization of commodity hardware, Comet 
+ensures cost-effectiveness and scalability for your Spark deployments.
 
-- Apache Spark 3.2, 3.3, or 3.4
-- JDK 8, 11 and 17 (JDK 11 recommended because Spark 3.2 doesn't support 17)
-- GLIBC 2.17 (Centos 7) and up
+## Spark Compatibility
 
-## Getting started
+Comet aims for 100% compatibility with all supported versions of Apache Spark, 
allowing you to integrate Comet into
+your existing Spark deployments and workflows seamlessly. With no code changes 
required, you can immediately harness
+the benefits of Comet's acceleration capabilities without disrupting your 
Spark applications.
 
-See the [DataFusion Comet User 
Guide](https://datafusion.apache.org/comet/user-guide/installation.html) for 
installation instructions.
+## Tight Integration with Apache DataFusion
+
+Comet tightly integrates with the core Apache DataFusion project, leveraging 
its powerful execution engine. With
+seamless interoperability between Comet and DataFusion, you can achieve 
optimal performance and efficiency in your
+Spark workloads.
+
+## Active Community
+
+Comet boasts a vibrant and active community of developers, contributors, and 
users dedicated to advancing the
+capabilities of Apache DataFusion and accelerating the performance of Apache 
Spark.
+
+## Getting Started
+
+To get started with Apache DataFusion Comet, follow the
+[installation 
instructions](https://datafusion.apache.org/comet/user-guide/installation.html).
 Join the
+[DataFusion Slack and Discord 
channels](https://datafusion.apache.org/contributor-guide/communication.html) 
to connect
+with other users, ask questions, and share your experiences with Comet.
 
 ## Contributing
-See the [DataFusion Comet Contribution 
Guide](https://datafusion.apache.org/comet/contributor-guide/contributing.html)
-for information on how to get started contributing to the project.
+
+We welcome contributions from the community to help improve and enhance Apache 
DataFusion Comet. Whether it's fixing
+bugs, adding new features, writing documentation, or optimizing performance, 
your contributions are invaluable in
+shaping the future of Comet. Check out our
+[contributor 
guide](https://datafusion.apache.org/comet/contributor-guide/contributing.html) 
to get started.
+
+## License
+
+Apache DataFusion Comet is licensed under the Apache License 2.0. See the 
[LICENSE.txt](LICENSE.txt) file for details.
+
+## Acknowledgments
+
+We would like to express our gratitude to the Apache DataFusion community for 
their support and contributions to
+Comet. Together, we're building a faster, more efficient future for big data 
processing with Apache Spark.
diff --git a/docs/source/_static/images/tpch_allqueries.png 
b/docs/source/_static/images/tpch_allqueries.png
new file mode 100644
index 00000000..a6788d5a
Binary files /dev/null and b/docs/source/_static/images/tpch_allqueries.png 
differ
diff --git a/docs/source/_static/images/tpch_queries_compare.png 
b/docs/source/_static/images/tpch_queries_compare.png
new file mode 100644
index 00000000..92768061
Binary files /dev/null and 
b/docs/source/_static/images/tpch_queries_compare.png differ
diff --git a/docs/source/_static/images/tpch_queries_speedup.png 
b/docs/source/_static/images/tpch_queries_speedup.png
new file mode 100644
index 00000000..fb417ff1
Binary files /dev/null and 
b/docs/source/_static/images/tpch_queries_speedup.png differ
diff --git 
a/docs/source/contributor-guide/benchmark-results/2024-05-30/comet-8-exec-5-runs.json
 
b/docs/source/contributor-guide/benchmark-results/2024-05-30/comet-8-exec-5-runs.json
new file mode 100644
index 00000000..38142151
--- /dev/null
+++ 
b/docs/source/contributor-guide/benchmark-results/2024-05-30/comet-8-exec-5-runs.json
@@ -0,0 +1,201 @@
+{
+    "engine": "datafusion-comet",
+    "benchmark": "tpch",
+    "data_path": "/mnt/bigdata/tpch/sf100/",
+    "query_path": "../../tpch/queries",
+    "spark_conf": {
+        "spark.comet.explainFallback.enabled": "true",
+        "spark.jars": 
"file:///home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar",
+        "spark.comet.cast.allowIncompatible": "true",
+        "spark.executor.extraClassPath": 
"/home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar",
+        "spark.executor.memory": "8G",
+        "spark.comet.exec.shuffle.enabled": "true",
+        "spark.app.name": "DataFusion Comet Benchmark derived from TPC-H / 
TPC-DS",
+        "spark.driver.port": "36573",
+        "spark.sql.adaptive.coalescePartitions.enabled": "false",
+        "spark.app.startTime": "1716923498046",
+        "spark.comet.batchSize": "8192",
+        "spark.app.id": "app-20240528131138-0043",
+        "spark.serializer.objectStreamReset": "100",
+        "spark.app.initial.jar.urls": 
"spark://woody.lan:36573/jars/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar",
+        "spark.submit.deployMode": "client",
+        "spark.sql.autoBroadcastJoinThreshold": "-1",
+        "spark.comet.exec.all.enabled": "true",
+        "spark.eventLog.enabled": "false",
+        "spark.driver.host": "woody.lan",
+        "spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=false 
-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add- [...]
+        "spark.sql.warehouse.dir": 
"file:/home/andy/git/apache/datafusion-benchmarks/runners/datafusion-comet/spark-warehouse",
+        "spark.shuffle.manager": 
"org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager",
+        "spark.comet.exec.enabled": "true",
+        "spark.repl.local.jars": 
"file:///home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar",
+        "spark.executor.id": "driver",
+        "spark.master": "spark://woody:7077",
+        "spark.executor.instances": "8",
+        "spark.comet.exec.shuffle.mode": "auto",
+        "spark.sql.extensions": "org.apache.comet.CometSparkSessionExtensions",
+        "spark.driver.memory": "8G",
+        "spark.driver.extraClassPath": 
"/home/andy/git/apache/datafusion-comet/spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar",
+        "spark.rdd.compress": "True",
+        "spark.executor.extraJavaOptions": 
"-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED --ad [...]
+        "spark.cores.max": "8",
+        "spark.comet.enabled": "true",
+        "spark.app.submitTime": "1716923497738",
+        "spark.submit.pyFiles": "",
+        "spark.executor.cores": "1",
+        "spark.comet.parquet.io.enabled": "false"
+    },
+    "1": [
+        32.121661901474,
+        27.997092485427856,
+        27.756758451461792,
+        28.55236315727234,
+        28.332542181015015
+    ],
+    "2": [
+        18.269107580184937,
+        16.200955629348755,
+        16.194639682769775,
+        16.745808839797974,
+        16.59864115715027
+    ],
+    "3": [
+        17.265466690063477,
+        17.069786310195923,
+        17.12887978553772,
+        19.33678102493286,
+        18.182055234909058
+    ],
+    "4": [
+        8.367004156112671,
+        8.172023296356201,
+        8.023266077041626,
+        8.350765228271484,
+        8.258736610412598
+    ],
+    "5": [
+        34.10048794746399,
+        32.69314408302307,
+        33.21383595466614,
+        36.391114473342896,
+        39.00048065185547
+    ],
+    "6": [
+        3.1693499088287354,
+        3.044705390930176,
+        3.047694206237793,
+        3.2817511558532715,
+        3.274174928665161
+    ],
+    "7": [
+        25.369214296340942,
+        24.020941257476807,
+        24.0787034034729,
+        28.47402787208557,
+        28.23443365097046
+    ],
+    "8": [
+        40.06126809120178,
+        39.828824281692505,
+        45.250510454177856,
+        44.406742572784424,
+        48.98451232910156
+    ],
+    "9": [
+        62.822797775268555,
+        61.26328158378601,
+        64.95581865310669,
+        69.51708793640137,
+        73.52380013465881
+    ],
+    "10": [
+        20.55334782600403,
+        20.546096324920654,
+        20.57452392578125,
+        22.84211039543152,
+        23.724371671676636
+    ],
+    "11": [
+        11.068235158920288,
+        10.715423822402954,
+        11.353424310684204,
+        11.37632942199707,
+        11.530814170837402
+    ],
+    "12": [
+        10.264788389205933,
+        8.67864990234375,
+        8.845952033996582,
+        8.593009233474731,
+        8.540803909301758
+    ],
+    "13": [
+        9.603406190872192,
+        9.648627042770386,
+        13.040799140930176,
+        10.154011249542236,
+        9.716034412384033
+    ],
+    "14": [
+        6.20926308631897,
+        6.0385496616363525,
+        7.674488544464111,
+        10.53052043914795,
+        7.661675691604614
+    ],
+    "15": [
+        11.466301918029785,
+        11.473632097244263,
+        11.279382228851318,
+        13.291078329086304,
+        12.81026816368103
+    ],
+    "16": [
+        8.096073865890503,
+        7.73410701751709,
+        7.742897272109985,
+        8.477537631988525,
+        7.821273326873779
+    ],
+    "17": [
+        43.69264578819275,
+        43.33040428161621,
+        46.291987657547,
+        54.654345989227295,
+        54.37124800682068
+    ],
+    "18": [
+        27.205485105514526,
+        26.785916090011597,
+        27.331408262252808,
+        29.946768760681152,
+        28.037617444992065
+    ],
+    "19": [
+        8.100102186203003,
+        7.845783472061157,
+        8.52329158782959,
+        8.907397985458374,
+        9.13755488395691
+    ],
+    "20": [
+        13.09695029258728,
+        12.683861255645752,
+        15.612725019454956,
+        13.361177206039429,
+        16.614356517791748
+    ],
+    "21": [
+        43.69623780250549,
+        43.26758122444153,
+        46.91650056838989,
+        47.875754833221436,
+        57.9763662815094
+    ],
+    "22": [
+        4.5090577602386475,
+        4.420571804046631,
+        4.639787673950195,
+        5.118046998977661,
+        5.017346143722534
+    ]
+}
\ No newline at end of file
diff --git 
a/docs/source/contributor-guide/benchmark-results/2024-05-30/datafusion-python-8-cores.json
 
b/docs/source/contributor-guide/benchmark-results/2024-05-30/datafusion-python-8-cores.json
new file mode 100644
index 00000000..f032d536
--- /dev/null
+++ 
b/docs/source/contributor-guide/benchmark-results/2024-05-30/datafusion-python-8-cores.json
@@ -0,0 +1,73 @@
+{
+    "engine": "datafusion-python",
+    "datafusion-version": "38.0.1",
+    "benchmark": "tpch",
+    "data_path": "/mnt/bigdata/tpch/sf100/",
+    "query_path": "../../tpch/queries/",
+    "1": [
+        7.410699844360352
+    ],
+    "2": [
+        2.966364622116089
+    ],
+    "3": [
+        3.988652467727661
+    ],
+    "4": [
+        1.8821499347686768
+    ],
+    "5": [
+        6.957948684692383
+    ],
+    "6": [
+        1.779731273651123
+    ],
+    "7": [
+        14.559604167938232
+    ],
+    "8": [
+        7.062309265136719
+    ],
+    "9": [
+        14.908353805541992
+    ],
+    "10": [
+        7.73533296585083
+    ],
+    "11": [
+        2.346423387527466
+    ],
+    "12": [
+        2.7248904705047607
+    ],
+    "13": [
+        6.38663387298584
+    ],
+    "14": [
+        2.4675676822662354
+    ],
+    "15": [
+        4.799000024795532
+    ],
+    "16": [
+        1.9091999530792236
+    ],
+    "17": [
+        19.230653762817383
+    ],
+    "18": [
+        25.15683078765869
+    ],
+    "19": [
+        4.2268781661987305
+    ],
+    "20": [
+        8.66620659828186
+    ],
+    "21": [
+        17.696006059646606
+    ],
+    "22": [
+        1.3805692195892334
+    ]
+}
\ No newline at end of file
diff --git 
a/docs/source/contributor-guide/benchmark-results/2024-05-30/spark-8-exec-5-runs.json
 
b/docs/source/contributor-guide/benchmark-results/2024-05-30/spark-8-exec-5-runs.json
new file mode 100644
index 00000000..012b05c3
--- /dev/null
+++ 
b/docs/source/contributor-guide/benchmark-results/2024-05-30/spark-8-exec-5-runs.json
@@ -0,0 +1,184 @@
+{
+    "engine": "datafusion-comet",
+    "benchmark": "tpch",
+    "data_path": "/mnt/bigdata/tpch/sf100/",
+    "query_path": "../../tpch/queries",
+    "spark_conf": {
+        "spark.driver.extraJavaOptions": "-Djava.net.preferIPv6Addresses=false 
-XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add- [...]
+        "spark.sql.warehouse.dir": 
"file:/home/andy/git/apache/datafusion-benchmarks/runners/datafusion-comet/spark-warehouse",
+        "spark.app.id": "app-20240528090804-0041",
+        "spark.app.submitTime": "1716908883258",
+        "spark.executor.memory": "8G",
+        "spark.master": "spark://woody:7077",
+        "spark.executor.id": "driver",
+        "spark.executor.instances": "8",
+        "spark.app.name": "DataFusion Comet Benchmark derived from TPC-H / 
TPC-DS",
+        "spark.driver.memory": "8G",
+        "spark.rdd.compress": "True",
+        "spark.executor.extraJavaOptions": 
"-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED --ad [...]
+        "spark.serializer.objectStreamReset": "100",
+        "spark.cores.max": "8",
+        "spark.submit.pyFiles": "",
+        "spark.executor.cores": "1",
+        "spark.submit.deployMode": "client",
+        "spark.sql.autoBroadcastJoinThreshold": "-1",
+        "spark.eventLog.enabled": "false",
+        "spark.app.startTime": "1716908883579",
+        "spark.driver.port": "33725",
+        "spark.driver.host": "woody.lan"
+    },
+    "1": [
+        76.91316103935242,
+        79.55859923362732,
+        81.10397529602051,
+        79.01998662948608,
+        79.1286551952362
+    ],
+    "2": [
+        23.977370262145996,
+        22.214473247528076,
+        22.686659812927246,
+        22.016682386398315,
+        21.766324520111084
+    ],
+    "3": [
+        22.700742721557617,
+        21.980144739151,
+        21.876065969467163,
+        21.661516189575195,
+        21.69345998764038
+    ],
+    "4": [
+        17.377647638320923,
+        16.249598264694214,
+        16.15747308731079,
+        16.128843069076538,
+        16.04338026046753
+    ],
+    "5": [
+        44.38863182067871,
+        45.47764492034912,
+        45.76063895225525,
+        45.16393995285034,
+        60.848369121551514
+    ],
+    "6": [
+        3.2041075229644775,
+        2.970944881439209,
+        2.891291856765747,
+        2.9719409942626953,
+        3.0702600479125977
+    ],
+    "7": [
+        24.369274377822876,
+        24.684266567230225,
+        24.146574020385742,
+        24.023175716400146,
+        30.56047773361206
+    ],
+    "8": [
+        46.46081209182739,
+        45.9838604927063,
+        46.341185092926025,
+        45.833823919296265,
+        46.61182403564453
+    ],
+    "9": [
+        67.67960548400879,
+        67.34667444229126,
+        70.34601259231567,
+        71.24095153808594,
+        84.38811421394348
+    ],
+    "10": [
+        19.16477870941162,
+        19.081010580062866,
+        19.501060009002686,
+        19.165698528289795,
+        20.216782331466675
+    ],
+    "11": [
+        17.158706426620483,
+        17.05184030532837,
+        17.714542150497437,
+        17.004602909088135,
+        17.700096130371094
+    ],
+    "12": [
+        11.654477834701538,
+        11.805298805236816,
+        11.822469234466553,
+        12.79678750038147,
+        13.64478850364685
+    ],
+    "13": [
+        20.430822372436523,
+        20.18759250640869,
+        21.26596975326538,
+        21.234288454055786,
+        20.189200162887573
+    ],
+    "14": [
+        5.60215950012207,
+        5.160705089569092,
+        5.080057382583618,
+        4.937625408172607,
+        5.853632688522339
+    ],
+    "15": [
+        14.17775845527649,
+        13.898571729660034,
+        14.215840578079224,
+        14.316090106964111,
+        14.356236457824707
+    ],
+    "16": [
+        6.252386808395386,
+        6.010213375091553,
+        6.054978370666504,
+        5.886059522628784,
+        5.923115253448486
+    ],
+    "17": [
+        71.41593313217163,
+        70.25399804115295,
+        72.07622528076172,
+        72.27566242218018,
+        72.20579051971436
+    ],
+    "18": [
+        65.72738265991211,
+        65.47461080551147,
+        67.14260482788086,
+        65.95489883422852,
+        69.51795554161072
+    ],
+    "19": [
+        7.1520891189575195,
+        6.516514301300049,
+        6.580992698669434,
+        6.486274242401123,
+        6.418147087097168
+    ],
+    "20": [
+        12.619760036468506,
+        12.235978126525879,
+        12.116347551345825,
+        12.161245584487915,
+        12.30910348892212
+    ],
+    "21": [
+        60.795483350753784,
+        60.484593629837036,
+        61.27316427230835,
+        60.475560426712036,
+        81.21473670005798
+    ],
+    "22": [
+        8.926804065704346,
+        8.113754034042358,
+        8.029133796691895,
+        7.99291467666626,
+        8.439452648162842
+    ]
+}
\ No newline at end of file
diff --git a/docs/source/contributor-guide/benchmarking.md 
b/docs/source/contributor-guide/benchmarking.md
index 502b35c2..3e9a61ef 100644
--- a/docs/source/contributor-guide/benchmarking.md
+++ b/docs/source/contributor-guide/benchmarking.md
@@ -19,44 +19,87 @@ under the License.
 
 # Comet Benchmarking Guide
 
-To track progress on performance, we regularly run benchmarks derived from 
TPC-H and TPC-DS. Benchmarking scripts are
-available in the [DataFusion 
Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
+To track progress on performance, we regularly run benchmarks derived from 
TPC-H and TPC-DS. Data generation and 
+benchmarking documentation and scripts are available in the [DataFusion 
Benchmarks](https://github.com/apache/datafusion-benchmarks) GitHub repository.
 
-Here is an example command for running the benchmarks. This command will need 
to be adapted based on the Spark 
-environment and location of data files.
+Here are example commands for running the benchmarks against a Spark cluster. 
This command will need to be 
+adapted based on the Spark environment and location of data files.
 
-This command assumes that `datafusion-benchmarks` is checked out in a parallel 
directory to `datafusion-comet`.
+These commands are intended to be run from the `runners/datafusion-comet` 
directory in the `datafusion-benchmarks` 
+repository.
+
+## Running Benchmarks Against Apache Spark
 
 ```shell
-$SPARK_HOME/bin/spark-submit \ 
-    --master "local[*]" \ 
-    --conf spark.driver.memory=8G \ 
-    --conf spark.executor.memory=64G \ 
-    --conf spark.executor.cores=16 \ 
-    --conf spark.cores.max=16 \ 
-    --conf spark.eventLog.enabled=true \ 
-    --conf spark.sql.autoBroadcastJoinThreshold=-1 \ 
-    --jars $COMET_JAR \ 
-    --conf spark.driver.extraClassPath=$COMET_JAR \ 
-    --conf spark.executor.extraClassPath=$COMET_JAR \ 
-    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ 
-    --conf spark.comet.enabled=true \ 
-    --conf spark.comet.exec.enabled=true \ 
-    --conf spark.comet.exec.all.enabled=true \ 
-    --conf spark.comet.cast.allowIncompatible=true \ 
-    --conf spark.comet.explainFallback.enabled=true \ 
-    --conf spark.comet.parquet.io.enabled=false \ 
-    --conf spark.comet.batchSize=8192 \ 
-    --conf spark.comet.columnar.shuffle.enabled=false \ 
-    --conf spark.comet.exec.shuffle.enabled=true \ 
-    --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \ 
-    --conf spark.sql.adaptive.coalescePartitions.enabled=false \ 
-    --conf spark.comet.shuffle.enforceMode.enabled=true \
-    ../datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
-    --benchmark tpch \ 
-    --data /mnt/bigdata/tpch/sf100-parquet/ \ 
-    --queries ../datafusion-benchmarks/tpch/queries 
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.memory=32G \
+    --conf spark.executor.cores=8 \
+    --conf spark.cores.max=8 \
+    --conf spark.sql.autoBroadcastJoinThreshold=-1 \
+    tpcbench.py \
+    --benchmark tpch \
+    --data /mnt/bigdata/tpch/sf100/ \
+    --queries ../../tpch/queries \
+    --iterations 5
 ```
 
-Comet performance can be compared to regular Spark performance by running the 
benchmark twice, once with 
-`spark.comet.enabled` set to `true` and once with it set to `false`. 
\ No newline at end of file
+## Running Benchmarks Against Apache Spark with Apache DataFusion Comet Enabled
+
+```shell
+$SPARK_HOME/bin/spark-submit \
+    --master $SPARK_MASTER \
+    --conf spark.driver.memory=8G \
+    --conf spark.executor.memory=64G \
+    --conf spark.executor.cores=8 \
+    --conf spark.cores.max=8 \
+    --conf spark.sql.autoBroadcastJoinThreshold=-1 \
+    --jars $COMET_JAR \
+    --conf spark.driver.extraClassPath=$COMET_JAR \
+    --conf spark.executor.extraClassPath=$COMET_JAR \
+    --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
+    --conf spark.comet.enabled=true \
+    --conf spark.comet.exec.enabled=true \
+    --conf spark.comet.exec.all.enabled=true \
+    --conf spark.comet.cast.allowIncompatible=true \
+    --conf spark.comet.explainFallback.enabled=true \
+    --conf spark.comet.parquet.io.enabled=false \
+    --conf spark.comet.batchSize=8192 \
+    --conf spark.comet.exec.shuffle.enabled=true \
+    --conf spark.comet.exec.shuffle.mode=auto \
+    --conf 
spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager
 \
+    --conf spark.sql.adaptive.coalescePartitions.enabled=false \
+    tpcbench.py \
+    --benchmark tpch \
+    --data /mnt/bigdata/tpch/sf100/ \
+    --queries ../../tpch/queries \
+    --iterations 5
+```
+
+## Current Performance
+
+Comet is not yet achieving full DataFusion speeds in all cases, but with 
future work we aim to provide a 2x-4x speedup
+for many use cases.
+
+The following benchmarks were performed on a Linux workstation with PCIe 5, 
AMD 7950X CPU (16 cores), 128 GB RAM, and 
+data stored locally on NVMe storage. Performance characteristics will vary in 
different environments and we encourage 
+you to run these benchmarks in your own environments.
+
+![](../../_static/images/tpch_allqueries.png)
+
+Here is a breakdown showing relative performance of Spark, Comet, and 
DataFusion for each TPC-H query.
+
+![](../../_static/images/tpch_queries_compare.png)
+
+The following chart shows how much Comet currently accelerates each query from 
the benchmark. Performance optimization
+is an ongoing task, and we welcome contributions from the community to help 
achieve even greater speedups in the future.
+
+![](../../_static/images/tpch_queries_speedup.png)
+
+The raw results of these benchmarks in JSON format is available here:
+
+- [Spark](./benchmark-results/2024-05-30/spark-8-exec-5-runs.json)
+- [Comet](./benchmark-results/2024-05-30/comet-8-exec-5-runs.json)
+- [DataFusion](./benchmark-results/2024-05-30/datafusion-python-8-cores.json)
+ 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-comet) branch main updated: chore: Update README to highlight Comet benefits (#497)

Reply via email to