This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-comet.git
The following commit(s) were added to refs/heads/main by this push:
new a340748 docs: Move existing documentation into new Contributor Guide
and add Getting Started section (#334)
a340748 is described below
commit a340748c3ba526e6994b8e00248e261b8be9abe5
Author: Andy Grove <[email protected]>
AuthorDate: Fri Apr 26 16:43:18 2024 -0600
docs: Move existing documentation into new Contributor Guide and add
Getting Started section (#334)
---
.github/workflows/benchmark-tpch.yml | 2 +
.github/workflows/benchmark.yml | 2 +
.github/workflows/pr_build.yml | 2 +
.github/workflows/spark_sql_test.yml | 2 +
EXPRESSIONS.md | 109 ---------------------
docs/source/contributor-guide/contributing.md | 52 ++++++++++
.../source/contributor-guide/debugging.md | 46 +++++----
.../source/contributor-guide/development.md | 20 ++--
docs/source/index.rst | 15 ++-
docs/source/{ => user-guide}/compatibility.md | 2 +-
docs/source/user-guide/expressions.md | 109 +++++++++++++++++++++
11 files changed, 221 insertions(+), 140 deletions(-)
diff --git a/.github/workflows/benchmark-tpch.yml
b/.github/workflows/benchmark-tpch.yml
index 54f9941..fbf5cfd 100644
--- a/.github/workflows/benchmark-tpch.yml
+++ b/.github/workflows/benchmark-tpch.yml
@@ -25,10 +25,12 @@ on:
push:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
pull_request:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
# manual trigger
#
https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 8b3ae7c..e9767f7 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -25,10 +25,12 @@ on:
push:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
pull_request:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
# manual trigger
#
https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
diff --git a/.github/workflows/pr_build.yml b/.github/workflows/pr_build.yml
index 1c1baf6..71eb02a 100644
--- a/.github/workflows/pr_build.yml
+++ b/.github/workflows/pr_build.yml
@@ -25,10 +25,12 @@ on:
push:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
pull_request:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
# manual trigger
#
https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
diff --git a/.github/workflows/spark_sql_test.yml
b/.github/workflows/spark_sql_test.yml
index 5c460b7..94958b4 100644
--- a/.github/workflows/spark_sql_test.yml
+++ b/.github/workflows/spark_sql_test.yml
@@ -25,10 +25,12 @@ on:
push:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
pull_request:
paths-ignore:
- "doc/**"
+ - "docs/**"
- "**.md"
# manual trigger
#
https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
diff --git a/EXPRESSIONS.md b/EXPRESSIONS.md
deleted file mode 100644
index f0a2f69..0000000
--- a/EXPRESSIONS.md
+++ /dev/null
@@ -1,109 +0,0 @@
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Expressions Supported by Comet
-
-The following Spark expressions are currently available:
-
-+ Literals
-+ Arithmetic Operators
- + UnaryMinus
- + Add/Minus/Multiply/Divide/Remainder
-+ Conditional functions
- + Case When
- + If
-+ Cast
-+ Coalesce
-+ BloomFilterMightContain
-+ Boolean functions
- + And
- + Or
- + Not
- + EqualTo
- + EqualNullSafe
- + GreaterThan
- + GreaterThanOrEqual
- + LessThan
- + LessThanOrEqual
- + IsNull
- + IsNotNull
- + In
-+ String functions
- + Substring
- + Coalesce
- + StringSpace
- + Like
- + Contains
- + Startswith
- + Endswith
- + Ascii
- + Bit_length
- + Octet_length
- + Upper
- + Lower
- + Chr
- + Initcap
- + Trim/Btrim/Ltrim/Rtrim
- + Concat_ws
- + Repeat
- + Length
- + Reverse
- + Instr
- + Replace
- + Translate
-+ Bitwise functions
- + Shiftright/Shiftleft
-+ Date/Time functions
- + Year/Hour/Minute/Second
-+ Math functions
- + Abs
- + Acos
- + Asin
- + Atan
- + Atan2
- + Cos
- + Exp
- + Ln
- + Log10
- + Log2
- + Pow
- + Round
- + Signum
- + Sin
- + Sqrt
- + Tan
- + Ceil
- + Floor
-+ Aggregate functions
- + Count
- + Sum
- + Max
- + Min
- + Avg
- + First
- + Last
- + BitAnd
- + BitOr
- + BitXor
- + BoolAnd
- + BoolOr
- + CovPopulation
- + CovSample
- + VariancePop
- + VarianceSamp
diff --git a/docs/source/contributor-guide/contributing.md
b/docs/source/contributor-guide/contributing.md
new file mode 100644
index 0000000..2262692
--- /dev/null
+++ b/docs/source/contributor-guide/contributing.md
@@ -0,0 +1,52 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Contributing to Apache DataFusion Comet
+
+We welcome contributions to Comet in many areas, and encourage new
contributors to get involved.
+
+Here are some areas where you can help:
+
+- Testing Comet with existing Spark jobs and reporting issues for any bugs or
performance issues
+- Contributing code to support Spark expressions, operators, and data types
that are not currently supported
+- Reviewing pull requests and helping to test new features for correctness and
performance
+- Improving documentation
+
+## Finding issues to work on
+
+We maintain a list of good first issues in GitHub
[here](https://github.com/apache/datafusion-comet/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22).
+
+## Reporting issues
+
+We use [GitHub issues](https://github.com/apache/datafusion-comet/issues) for
bug reports and feature requests.
+
+## Asking for Help
+
+The Comet project uses the same Slack and Discord channels as the main Apache
DataFusion project. See details at
+[Apache DataFusion Communications]. There are dedicated Comet channels in both
Slack and Discord.
+
+## Regular public meetings
+
+The Comet contributors hold regular video calls where new and current
contributors are welcome to ask questions and
+coordinate on issues that they are working on.
+
+See the [Apache DataFusion Comet community meeting] Google document for more
information.
+
+[Apache DataFusion Communications]:
https://datafusion.apache.org/contributor-guide/communication.html
+[Apache DataFusion Comet community meeting]:
https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit?usp=sharing
diff --git a/DEBUGGING.md b/docs/source/contributor-guide/debugging.md
similarity index 87%
rename from DEBUGGING.md
rename to docs/source/contributor-guide/debugging.md
index 754316a..3b20ed0 100644
--- a/DEBUGGING.md
+++ b/docs/source/contributor-guide/debugging.md
@@ -20,12 +20,13 @@ under the License.
# Comet Debugging Guide
This HOWTO describes how to debug JVM code and Native code concurrently. The
guide assumes you have:
+
1. Intellij as the Java IDE
2. CLion as the Native IDE. For Rust code, the CLion Rust language plugin is
required. Note that the
-Intellij Rust plugin is not sufficient.
+ Intellij Rust plugin is not sufficient.
3. CLion/LLDB as the native debugger. CLion ships with a bundled LLDB and the
Rust community has
-its own packaging of LLDB (`lldb-rust`). Both provide a better display of Rust
symbols than plain
-LLDB or the LLDB that is bundled with XCode. We will use the LLDB packaged
with CLion for this guide.
+ its own packaging of LLDB (`lldb-rust`). Both provide a better display of
Rust symbols than plain
+ LLDB or the LLDB that is bundled with XCode. We will use the LLDB packaged
with CLion for this guide.
4. We will use a Comet _unit_ test as the canonical use case.
_Caveat: The steps here have only been tested with JDK 11_ on Mac (M1)
@@ -42,21 +43,24 @@ use advanced `lldb` debugging.
1. Add a Debug Configuration for the unit test
1. In the Debug Configuration for that unit test add `-Xint` as a JVM
parameter. This option is
-undocumented *magic*. Without this, the LLDB debugger hits a EXC_BAD_ACCESS
(or EXC_BAD_INSTRUCTION) from
-which one cannot recover.
+ undocumented _magic_. Without this, the LLDB debugger hits a EXC_BAD_ACCESS
(or EXC_BAD_INSTRUCTION) from
+ which one cannot recover.
+
+1. Add a println to the unit test to print the PID of the JVM process. (jps
can also be used but this is less error prone if you have multiple jvm
processes running)
+
+ ```JDK8
+ println("Waiting for Debugger: PID - ",
ManagementFactory.getRuntimeMXBean().getName())
+ ```
+
+ This will print something like : `PID@your_machine_name`.
-1. Add a println to the unit test to print the PID of the JVM process. (jps
can also be used but this is less error prone if you have multiple jvm
processes running)
- ``` JDK8
- println("Waiting for Debugger: PID - ",
ManagementFactory.getRuntimeMXBean().getName())
- ```
- This will print something like : `PID@your_machine_name`.
+ For JDK9 and newer
- For JDK9 and newer
- ```JDK9
- println("Waiting for Debugger: PID - ", ProcessHandle.current.pid)
- ```
+ ```JDK9
+ println("Waiting for Debugger: PID - ", ProcessHandle.current.pid)
+ ```
- ==> Note the PID
+ ==> Note the PID
1. Debug-run the test in Intellij and wait for the breakpoint to be hit
@@ -96,7 +100,8 @@ Detecting the debugger
https://stackoverflow.com/questions/5393403/can-a-java-application-detect-that-a-debugger-is-attached#:~:text=No.,to%20let%20your%20app%20continue.&text=I%20know%20that%20those%20are,meant%20with%20my%20first%20phrase).
# Verbose debug
-By default, Comet outputs the exception details specific for Comet.
+
+By default, Comet outputs the exception details specific for Comet.
```scala
scala> spark.sql("my_failing_query").show(false)
@@ -112,7 +117,7 @@ This was likely caused by a bug in DataFusion's code and we
would welcome that y
```
There is a verbose exception option by leveraging DataFusion
[backtraces](https://arrow.apache.org/datafusion/user-guide/example-usage.html#enable-backtraces)
-This option allows to append native DataFusion stacktrace to the original
error message.
+This option allows to append native DataFusion stacktrace to the original
error message.
To enable this option with Comet it is needed to include `backtrace` feature
in
[Cargo.toml](https://github.com/apache/arrow-datafusion-comet/blob/main/core/Cargo.toml)
for DataFusion dependencies
```
@@ -129,15 +134,16 @@ RUST_BACKTRACE=1 $SPARK_HOME/spark-shell --jars
spark/target/comet-spark-spark3.
```
Get the expanded exception details
+
```scala
scala> spark.sql("my_failing_query").show(false)
24/03/05 17:00:49 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.comet.CometNativeException: Internal error: MIN/MAX is not expected
to receive scalars of incompatible types (Date32("NULL"), Int32(15901))
-backtrace:
+backtrace:
0: std::backtrace::Backtrace::create
1: datafusion_physical_expr::aggregate::min_max::min
- 2: <datafusion_physical_expr::aggregate::min_max::MinAccumulator as
datafusion_expr::accumulator::Accumulator>::update_batch
+ 2: <datafusion_physical_expr::aggregate::min_max::MinAccumulator as
datafusion_expr::accumulator::Accumulator>::update_batch
3: <futures_util::stream::stream::fuse::Fuse<S> as
futures_core::stream::Stream>::poll_next
4:
comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::{{closure}}
5: _Java_org_apache_comet_Native_executePlan
@@ -151,6 +157,8 @@ at
org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:126)
(reduced)
```
+
Note:
+
- The backtrace coverage in DataFusion is still improving. So there is a
chance the error still not covered, if so feel free to file a
[ticket](https://github.com/apache/arrow-datafusion/issues)
- The backtrace evaluation comes with performance cost and intended mostly for
debugging purposes
diff --git a/DEVELOPMENT.md b/docs/source/contributor-guide/development.md
similarity index 92%
rename from DEVELOPMENT.md
rename to docs/source/contributor-guide/development.md
index 6dc0f1f..63146c1 100644
--- a/DEVELOPMENT.md
+++ b/docs/source/contributor-guide/development.md
@@ -49,25 +49,29 @@ A few common commands are specified in project's `Makefile`:
- `make clean`: clean up the workspace
- `bin/comet-spark-shell -d . -o spark/target/` run Comet spark shell for V1
datasources
- `bin/comet-spark-shell -d . -o spark/target/ --conf
spark.sql.sources.useV1SourceList=""` run Comet spark shell for V2 datasources
-
+
## Development Environment
+
Comet is a multi-language project with native code written in Rust and JVM
code written in Java and Scala.
-For Rust code, the CLion IDE is recommended. For JVM code, IntelliJ IDEA is
recommended.
+For Rust code, the CLion IDE is recommended. For JVM code, IntelliJ IDEA is
recommended.
Before opening the project in an IDE, make sure to run `make` first to
generate the necessary files for the IDEs. Currently, it's mostly about
generating protobuf message classes for the JVM side. It's only required to
run `make` once after cloning the repo.
### IntelliJ IDEA
-First make sure to install the Scala plugin in IntelliJ IDEA.
+
+First make sure to install the Scala plugin in IntelliJ IDEA.
After that, you can open the project in IntelliJ IDEA. The IDE should
automatically detect the project structure and import as a Maven project.
### CLion
+
First make sure to install the Rust plugin in CLion or you can use the
dedicated Rust IDE: RustRover.
After that you can open the project in CLion. The IDE should automatically
detect the project structure and import as a Cargo project.
### Running Tests in IDEA
+
Like other Maven projects, you can run tests in IntelliJ IDEA by
right-clicking on the test class or test method and selecting "Run" or "Debug".
-However if the tests is related to the native side. Please make sure to run
`make core` or `cd core && cargo build` before running the tests in IDEA.
+However if the tests is related to the native side. Please make sure to run
`make core` or `cd core && cargo build` before running the tests in IDEA.
## Benchmark
@@ -82,9 +86,11 @@ To run TPC-H or TPC-DS micro benchmarks, please follow the
instructions
in the respective source code, e.g., `CometTPCHQueryBenchmark`.
## Debugging
+
Comet is a multi-language project with native code written in Rust and JVM
code written in Java and Scala.
-It is possible to debug both native and JVM code concurrently as described in
the [DEBUGGING guide](DEBUGGING.md)
+It is possible to debug both native and JVM code concurrently as described in
the [DEBUGGING guide](debugging)
## Submitting a Pull Request
-Comet uses `cargo fmt`, [Scalafix](https://github.com/scalacenter/scalafix)
and [Spotless](https://github.com/diffplug/spotless/tree/main/plugin-maven) to
-automatically format the code. Before submitting a pull request, you can
simply run `make format` to format the code.
\ No newline at end of file
+
+Comet uses `cargo fmt`, [Scalafix](https://github.com/scalacenter/scalafix)
and [Spotless](https://github.com/diffplug/spotless/tree/main/plugin-maven) to
+automatically format the code. Before submitting a pull request, you can
simply run `make format` to format the code.
diff --git a/docs/source/index.rst b/docs/source/index.rst
index c8f2735..4462a8d 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -35,16 +35,23 @@ Apache DataFusion Comet
Apache DataFusion Comet is an Apache Spark plugin that uses Apache DataFusion
as a native runtime to achieve improvement in terms of query efficiency and
query runtime.
-This documentation site is currently being developed. The most up-to-date
documentation can be found in the
-GitHub repository at https://github.com/apache/datafusion-comet.
+.. _toc.links:
+.. toctree::
+ :maxdepth: 1
+ :caption: User Guide
+
+ Supported Expressions <user-guide/expressions>
+ user-guide/compatibility
.. _toc.links:
.. toctree::
:maxdepth: 1
- :caption: Project Links
+ :caption: Contributor Guide
- compatibility
+ Getting Started <contributor-guide/contributing>
Github and Issue Tracker <https://github.com/apache/datafusion-comet>
+ contributor-guide/development
+ contributor-guide/debugging
.. _toc.asf-links:
.. toctree::
diff --git a/docs/source/compatibility.md
b/docs/source/user-guide/compatibility.md
similarity index 97%
rename from docs/source/compatibility.md
rename to docs/source/user-guide/compatibility.md
index 6e69f84..d817ba5 100644
--- a/docs/source/compatibility.md
+++ b/docs/source/user-guide/compatibility.md
@@ -34,7 +34,7 @@ There is an
[epic](https://github.com/apache/datafusion-comet/issues/313) where
## Cast
-Comet currently delegates to Apache DataFusion for most cast operations, and
this means that the behavior is not
+Comet currently delegates to Apache DataFusion for most cast operations, and
this means that the behavior is not
guaranteed to be consistent with Spark.
There is an [epic](https://github.com/apache/datafusion-comet/issues/286)
where we are tracking the work to implement Spark-compatible cast expressions.
diff --git a/docs/source/user-guide/expressions.md
b/docs/source/user-guide/expressions.md
new file mode 100644
index 0000000..f67a4ea
--- /dev/null
+++ b/docs/source/user-guide/expressions.md
@@ -0,0 +1,109 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Supported Spark Expressions
+
+The following Spark expressions are currently available:
+
+- Literals
+- Arithmetic Operators
+ - UnaryMinus
+ - Add/Minus/Multiply/Divide/Remainder
+- Conditional functions
+ - Case When
+ - If
+- Cast
+- Coalesce
+- BloomFilterMightContain
+- Boolean functions
+ - And
+ - Or
+ - Not
+ - EqualTo
+ - EqualNullSafe
+ - GreaterThan
+ - GreaterThanOrEqual
+ - LessThan
+ - LessThanOrEqual
+ - IsNull
+ - IsNotNull
+ - In
+- String functions
+ - Substring
+ - Coalesce
+ - StringSpace
+ - Like
+ - Contains
+ - Startswith
+ - Endswith
+ - Ascii
+ - Bit_length
+ - Octet_length
+ - Upper
+ - Lower
+ - Chr
+ - Initcap
+ - Trim/Btrim/Ltrim/Rtrim
+ - Concat_ws
+ - Repeat
+ - Length
+ - Reverse
+ - Instr
+ - Replace
+ - Translate
+- Bitwise functions
+ - Shiftright/Shiftleft
+- Date/Time functions
+ - Year/Hour/Minute/Second
+- Math functions
+ - Abs
+ - Acos
+ - Asin
+ - Atan
+ - Atan2
+ - Cos
+ - Exp
+ - Ln
+ - Log10
+ - Log2
+ - Pow
+ - Round
+ - Signum
+ - Sin
+ - Sqrt
+ - Tan
+ - Ceil
+ - Floor
+- Aggregate functions
+ - Count
+ - Sum
+ - Max
+ - Min
+ - Avg
+ - First
+ - Last
+ - BitAnd
+ - BitOr
+ - BitXor
+ - BoolAnd
+ - BoolOr
+ - CovPopulation
+ - CovSample
+ - VariancePop
+ - VarianceSamp
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]