This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new e9edd0cb45 Improve contributor guide (#5921)
e9edd0cb45 is described below
commit e9edd0cb4592eb3ca644bc2a2a7674042486802b
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Apr 10 17:31:55 2023 -0400
Improve contributor guide (#5921)
Move some content into the code organization section
---
datafusion-examples/README.md | 17 ++-
datafusion/common/src/tree_node.rs | 4 +-
datafusion/core/src/lib.rs | 204 +++++++++++++++++---------
datafusion/expr/src/operator.rs | 2 +-
datafusion/physical-expr/src/sort_expr.rs | 2 +-
datafusion/sql/src/planner.rs | 2 +-
docs/source/contributor-guide/architecture.md | 13 +-
docs/source/contributor-guide/index.md | 17 ++-
8 files changed, 169 insertions(+), 92 deletions(-)
diff --git a/datafusion-examples/README.md b/datafusion-examples/README.md
index a7d519fa41..df6ad5a467 100644
--- a/datafusion-examples/README.md
+++ b/datafusion-examples/README.md
@@ -21,10 +21,25 @@
This crate includes several examples of how to use various DataFusion APIs and
help you on your way.
-Prerequisites:
+## Prerequisites:
Run `git submodule update --init` to init test files.
+## Running Examples
+
+To run the examples, use the `cargo run` command, such as:
+
+```bash
+git clone https://github.com/apache/arrow-datafusion
+cd arrow-datafusion
+# Download test data
+git submodule update --init
+
+# Run the `csv_sql` example:
+# ... use the equivalent for other examples
+cargo run --example csv_sql
+```
+
## Single Process
- [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL
statement against a local AVRO file
diff --git a/datafusion/common/src/tree_node.rs
b/datafusion/common/src/tree_node.rs
index fcc11b0281..34d09fdc1f 100644
--- a/datafusion/common/src/tree_node.rs
+++ b/datafusion/common/src/tree_node.rs
@@ -289,7 +289,7 @@ impl<T> Transformed<T> {
/// Helper trait for implementing [`TreeNode`] that have children stored as
Arc's
///
/// If some trait object, such as `dyn T`, implements this trait,
-/// its related Arc<dyn T> will automatically implement [`TreeNode`]
+/// its related `Arc<dyn T>` will automatically implement [`TreeNode`]
pub trait DynTreeNode {
/// Returns all children of the specified TreeNode
fn arc_children(&self) -> Vec<Arc<Self>>;
@@ -303,7 +303,7 @@ pub trait DynTreeNode {
}
/// Blanket implementation for Arc for any tye that implements
-/// [`DynTreeNode`] (such as Arc<dyn PhysicalExpr>)
+/// [`DynTreeNode`] (such as [`Arc<dyn PhysicalExpr>`])
impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion>
where
diff --git a/datafusion/core/src/lib.rs b/datafusion/core/src/lib.rs
index 1bbd9a0ebf..94f7f9e39b 100644
--- a/datafusion/core/src/lib.rs
+++ b/datafusion/core/src/lib.rs
@@ -16,16 +16,37 @@
// under the License.
#![warn(missing_docs, clippy::needless_borrow)]
-//! [DataFusion](https://github.com/apache/arrow-datafusion)
-//! is an extensible query execution framework that uses
-//! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
+//! [DataFusion] is an extensible query engine written in Rust that
+//! uses [Apache Arrow] as its in-memory format. DataFusion's [use
+//! cases] include building very fast database and analytic systems,
+//! customized to particular workloads.
//!
-//! DataFusion supports both an SQL and a DataFrame API for building logical
query plans
-//! as well as a query optimizer and execution engine capable of parallel
execution
-//! against partitioned data sources (CSV and Parquet) using threads.
+//! "Out of the box," DataFusion quickly runs complex [SQL] and
+//! [`DataFrame`] queries using a sophisticated query planner, a columnar,
+//! multi-threaded, vectorized execution engine, and partitioned data
+//! sources (Parquet, CSV, JSON, and Avro).
//!
-//! Below is an example of how to execute a query against data stored
-//! in a CSV file using a [`DataFrame`](dataframe::DataFrame):
+//! DataFusion can also be easily customized to support additional
+//! data sources, query languages, functions, custom operators and
+//! more.
+//!
+//! [DataFusion]: https://arrow.apache.org/datafusion/
+//! [Apache Arrow]: https://arrow.apache.org
+//! [use cases]:
https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases
+//! [SQL]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+//! [`DataFrame`]: dataframe::DataFrame
+//!
+//! # Examples
+//!
+//! The main entry point for interacting with DataFusion is the
+//! [`SessionContext`].
+//!
+//! [`SessionContext`]: execution::context::SessionContext
+//!
+//! ## DataFrame
+//!
+//! To execute a query against data stored
+//! in a CSV file using a [`DataFrame`]:
//!
//! ```rust
//! # use datafusion::prelude::*;
@@ -64,7 +85,9 @@
//! # }
//! ```
//!
-//! and how to execute a query against a CSV using SQL:
+//! ## SQL
+//!
+//! To execute a query against a CSV file using [SQL]:
//!
//! ```
//! # use datafusion::prelude::*;
@@ -100,57 +123,109 @@
//! # }
//! ```
//!
-//! ## Parse, Plan, Optimize, Execute
+//! ## More Examples
+//!
+//! There are many additional annotated examples of using DataFusion in the
[datafusion-examples] directory.
+//!
+//! [datafusion-examples]:
https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples
+//!
+//! ## Customization and Extension
+//!
+//! DataFusion supports extension at many points:
+//!
+//! * read from any datasource ([`TableProvider`])
+//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`])
+//! * build your own query langue or plans using the ([`LogicalPlanBuilder`])
+//! * declare and use user-defined scalar functions ([`ScalarUDF`])
+//! * declare and use user-defined aggregate functions ([`AggregateUDF`])
+//! * add custom optimizer rewrite passes ([`OptimizerRule`] and
[`PhysicalOptimizerRule`])
+//! * extend the planner to use user-defined logical and physical nodes
([`QueryPlanner`])
+//!
+//! You can find examples of each of them in the [datafusion-examples]
directory.
+//!
+//! [`TableProvider`]: crate::datasource::TableProvider
+//! [`CatalogProvider`]: crate::catalog::catalog::CatalogProvider
+//! [`LogicalPlanBuilder`]:
datafusion_expr::logical_plan::builder::LogicalPlanBuilder
+//! [`ScalarUDF`]: physical_plan::udf::ScalarUDF
+//! [`AggregateUDF`]: physical_plan::udaf::AggregateUDF
+//! [`QueryPlanner`]: execution::context::QueryPlanner
+//! [`OptimizerRule`]: datafusion_optimizer::optimizer::OptimizerRule
+//! [`PhysicalOptimizerRule`]:
datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule
+//!
+//! # Code Organization
+//!
+//! ## Overview Presentations
+//!
+//! The following presentations offer high level overviews of the
+//! different components and how they interact together.
+//!
+//! - [Apr 2023]: The Apache Arrow DataFusion Architecture talks
+//! - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and
[slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
+//! - _Logical Plan and Expressions_:
[recording](https://youtu.be/EzZTLiSJnhY) and
[slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
+//! - _Physical Plan and Execution_:
[recording](https://youtu.be/2jkWU3_w6z0) and
[slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is
described in \*Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool
with a Rusty Query Engine:
[recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and
[slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
+//! - [March 2021]: The DataFusion architecture is described in _Query Engine
Design and the Rust-Based DataFusion in Apache Arrow_:
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s))
and
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is
described in \*Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//!
+//! ## Architecture
//!
//! DataFusion is a fully fledged query engine capable of performing complex
operations.
//! Specifically, when DataFusion receives an SQL query, there are different
steps
//! that it passes through until a result is obtained. Broadly, they are:
//!
-//! 1. The string is parsed to an Abstract syntax tree (AST) using
[sqlparser](https://docs.rs/sqlparser/latest/sqlparser/).
-//! 2. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical
expressions on the AST to logical expressions [`Expr`s](datafusion_expr::Expr).
-//! 3. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical nodes
on the AST to a [`LogicalPlan`](datafusion_expr::LogicalPlan).
-//! 4. [`OptimizerRules`](optimizer::optimizer::OptimizerRule) are applied to
the [`LogicalPlan`](datafusion_expr::LogicalPlan) to optimize it.
-//! 5. The [`LogicalPlan`](datafusion_expr::LogicalPlan) is converted to an
[`ExecutionPlan`](physical_plan::ExecutionPlan) by a
[`PhysicalPlanner`](physical_plan::PhysicalPlanner)
-//! 6. The [`ExecutionPlan`](physical_plan::ExecutionPlan) is executed against
data through the [`SessionContext`](execution::context::SessionContext)
+//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
+//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to
logical expressions [`Expr`]s.
+//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a
[`LogicalPlan`].
+//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
+//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a
[`PhysicalPlanner`]
+//! 6. The [`ExecutionPlan`]is executed against data through the
[`SessionContext`]
//!
-//! With a [`DataFrame`](dataframe::DataFrame) API, steps 1-3 are not used as
the DataFrame builds the [`LogicalPlan`](datafusion_expr::LogicalPlan) directly.
+//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds
the [`LogicalPlan`] directly.
//!
//! Phases 1-5 are typically cheap when compared to phase 6, and thus
DataFusion puts a
//! lot of effort to ensure that phase 6 runs efficiently and without errors.
//!
//! DataFusion's planning is divided in two main parts: logical planning and
physical planning.
//!
-//! ### Logical plan
+//! ### Logical planning
//!
-//! Logical planning yields [`logical plans`](datafusion_expr::LogicalPlan)
and [`logical expressions`](datafusion_expr::Expr).
-//! These are [`Schema`](arrow::datatypes::Schema)-aware traits that represent
statements whose result is independent of how it should physically be executed.
+//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
+//! expressions which are [`Schema`]aware and represent statements
+//! whose result is independent of how it should physically be
+//! executed.
//!
-//! A [`LogicalPlan`](datafusion_expr::LogicalPlan) is a Directed Acyclic
Graph (DAG) of other [`LogicalPlan`s](datafusion_expr::LogicalPlan) and each
node contains logical expressions ([`Expr`s](logical_expr::Expr)).
-//! All of these are located in [`datafusion_expr`](datafusion_expr).
+//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
+//! [`LogicalPlan`]s, and each node contains [`Expr`]s. All of these
+//! are located in [`datafusion_expr`] module.
//!
-//! ### Physical plan
+//! ### Physical planning
//!
-//! A Physical plan ([`ExecutionPlan`](physical_plan::ExecutionPlan)) is a
plan that can be executed against data.
-//! Contrarily to a logical plan, the physical plan has concrete information
about how the calculation
-//! should be performed (e.g. what Rust functions are used) and how data
should be loaded into memory.
+//! An [`ExecutionPlan`] (sometimes referred to as a "physical plan")
+//! is a plan that can be executed against data. Compared to a
+//! logical plan, the physical plan has concrete information about how
+//! calculations should be performed (e.g. what Rust functions are
+//! used) and how data should be loaded into memory.
//!
-//! [`ExecutionPlan`](physical_plan::ExecutionPlan) uses the Arrow format as
its in-memory representation of data, through the [arrow] crate.
-//! We recommend going through [its documentation](arrow) for details on how
the data is physically represented.
+//! [`ExecutionPlan`]s uses the [Apache Arrow] format as its in-memory
+//! representation of data, through the [arrow] crate. The [arrow]
+//! crate documents how the memory is physically represented.
//!
-//! A [`ExecutionPlan`](physical_plan::ExecutionPlan) is composed by nodes
(implement the trait [`ExecutionPlan`](physical_plan::ExecutionPlan)),
-//! and each node is composed by physical expressions
([`PhysicalExpr`](physical_plan::PhysicalExpr))
-//! or aggreagate expressions
([`AggregateExpr`](physical_plan::AggregateExpr)).
-//! All of these are located in the module [`physical_plan`](physical_plan).
+//! A [`ExecutionPlan`] is composed by nodes (which each implement the
+//! [`ExecutionPlan`] trait). Each node can contain physical
+//! expressions ([`PhysicalExpr`]) or aggreagate expressions
+//! ([`AggregateExpr`]). All of these are located in the
+//! [`physical_plan`] module.
//!
//! Broadly speaking,
//!
-//! * an [`ExecutionPlan`](physical_plan::ExecutionPlan) receives a partition
number and asynchronously returns
-//! an iterator over [`RecordBatch`](arrow::record_batch::RecordBatch)
-//! (a node-specific struct that implements
[`RecordBatchReader`](arrow::record_batch::RecordBatchReader))
-//! * a [`PhysicalExpr`](physical_plan::PhysicalExpr) receives a
[`RecordBatch`](arrow::record_batch::RecordBatch)
-//! and returns an [`Array`](arrow::array::Array)
-//! * an [`AggregateExpr`](physical_plan::AggregateExpr) receives
[`RecordBatch`es](arrow::record_batch::RecordBatch)
-//! and returns a [`RecordBatch`](arrow::record_batch::RecordBatch) of a
single row(*)
+//! * an [`ExecutionPlan`] receives a partition number and
+//! asynchronously returns an iterator over [`RecordBatch`] (a
+//! node-specific struct that implements [`RecordBatchReader`])
+//! * a [`PhysicalExpr`] receives a [`RecordBatch`]
+//! and returns an [`Array`]
+//! * an [`AggregateExpr`] receives a series of [`RecordBatch`]es
+//! and returns a [`RecordBatch`] of a single row(*)
//!
//! (*) Technically, it aggregates the results on each partition and then
merges the results into a single partition.
//!
@@ -173,39 +248,24 @@
//! * Scan from memory: [`MemoryExec`](physical_plan::memory::MemoryExec)
//! * Explain the plan: [`ExplainExec`](physical_plan::explain::ExplainExec)
//!
-//! ## Customize
-//!
-//! DataFusion allows users to
-//! * extend the planner to use user-defined logical and physical nodes
([`QueryPlanner`](execution::context::QueryPlanner))
-//! * declare and use user-defined scalar functions
([`ScalarUDF`](physical_plan::udf::ScalarUDF))
-//! * declare and use user-defined aggregate functions
([`AggregateUDF`](physical_plan::udaf::AggregateUDF))
-//!
-//! You can find examples of each of them in examples section.
-//!
-//! ## Examples
-//!
-//! Examples are located in [datafusion-examples
directory](https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples)
-//!
-//! Here's how to run them
-//!
-//! ```bash
-//! git clone https://github.com/apache/arrow-datafusion
-//! cd arrow-datafusion
-//! # Download test data
-//! git submodule update --init
-//!
-//! cargo run --example csv_sql
-//!
-//! cargo run --example parquet_sql
-//!
-//! cargo run --example dataframe
-//!
-//! cargo run --example dataframe_in_memory
-//!
-//! cargo run --example simple_udaf
-//!
-//! cargo run --example simple_udf
-//! ```
+//! Future topics (coming soon):
+//! * Analyzer Rules
+//! * Resource management (memory and disk)
+//!
+//! [sqlparser]: https://docs.rs/sqlparser/latest/sqlparser
+//! [`SqlToRel`]: sql::planner::SqlToRel
+//! [`Expr`]: datafusion_expr::Expr
+//! [`LogicalPlan`]: datafusion_expr::LogicalPlan
+//! [`OptimizerRule`]: optimizer::optimizer::OptimizerRule
+//! [`ExecutionPlan`]: physical_plan::ExecutionPlan
+//! [`PhysicalPlanner`]: physical_plan::PhysicalPlanner
+//! [`Schema`]: arrow::datatypes::Schema
+//! [`datafusion_expr`]: datafusion_expr
+//! [`PhysicalExpr`]: physical_plan::PhysicalExpr
+//! [`AggregateExpr`]: physical_plan::AggregateExpr
+//! [`RecordBatch`]: arrow::record_batch::RecordBatch
+//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader
+//! [`Array`]: arrow::array::Array
/// DataFusion crate version
pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");
diff --git a/datafusion/expr/src/operator.rs b/datafusion/expr/src/operator.rs
index 659e0d1af3..d554c5cebb 100644
--- a/datafusion/expr/src/operator.rs
+++ b/datafusion/expr/src/operator.rs
@@ -113,7 +113,7 @@ impl Operator {
/// Return true if the operator is a numerical operator.
///
/// For example, 'Binary(a, +, b)' would be a numerical expression.
- /// PostgresSQL concept:
https://www.postgresql.org/docs/7.0/operators2198.htm
+ /// PostgresSQL concept:
<https://www.postgresql.org/docs/7.0/operators2198.htm>
pub fn is_numerical_operators(&self) -> bool {
matches!(
self,
diff --git a/datafusion/physical-expr/src/sort_expr.rs
b/datafusion/physical-expr/src/sort_expr.rs
index 0f01760646..1699d25bf2 100644
--- a/datafusion/physical-expr/src/sort_expr.rs
+++ b/datafusion/physical-expr/src/sort_expr.rs
@@ -195,7 +195,7 @@ impl PhysicalSortRequirement {
///
/// This function converts `PhysicalSortRequirement` to `PhysicalSortExpr`
/// for each entry in the input. If required ordering is None for an entry
- /// default ordering `ASC, NULLS LAST` if given (see [`into_sort_expr`])
+ /// default ordering `ASC, NULLS LAST` if given (see
[`Self::into_sort_expr`])
pub fn to_sort_exprs(
requirements: impl IntoIterator<Item = PhysicalSortRequirement>,
) -> Vec<PhysicalSortExpr> {
diff --git a/datafusion/sql/src/planner.rs b/datafusion/sql/src/planner.rs
index 3673742d5b..f872aa5676 100644
--- a/datafusion/sql/src/planner.rs
+++ b/datafusion/sql/src/planner.rs
@@ -106,7 +106,7 @@ pub struct PlannerContext {
/// in `PREPARE` statement
prepare_param_data_types: Vec<DataType>,
/// Map of CTE name to logical plan of the WITH clause.
- /// Use Arc<LogicalPlan> to allow cheap cloning
+ /// Use `Arc<LogicalPlan>` to allow cheap cloning
ctes: HashMap<String, Arc<LogicalPlan>>,
/// The query schema of the outer query plan, used to resolve the columns
in subquery
outer_query_schema: Option<DFSchema>,
diff --git a/docs/source/contributor-guide/architecture.md
b/docs/source/contributor-guide/architecture.md
index 081866aa8f..48c065f5b7 100644
--- a/docs/source/contributor-guide/architecture.md
+++ b/docs/source/contributor-guide/architecture.md
@@ -19,13 +19,8 @@
# Architecture
-There is no formal document describing DataFusion's architecture yet, but the
following presentations offer a good overview of its different components and
how they interact together.
+DataFusion's code structure and organization is described in the
+[Crate Documentation], to keep it as close to the source as
+possible.
-- [Apr 2023]: The Apache Arrow DataFusion Architecture talks series by @alamb
- - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and
[slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
- - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY)
and
[slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30/edit#slide=id.gbe21b752a6_0_218)
- - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0)
and
[slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg/edit?usp=sharing)
-- [February 2021]: How DataFusion is used within the Ballista Project is
described in \*Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-- [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool
with a Rusty Query Engine:
[recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and
[slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
-- [March 2021]: The DataFusion architecture is described in _Query Engine
Design and the Rust-Based DataFusion in Apache Arrow_:
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s))
and
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-- [February 2021]: How DataFusion is used within the Ballista Project is
described in \*Ballista: Distributed Compute with Rust and Apache Arrow:
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+[crate documentation]:
https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
diff --git a/docs/source/contributor-guide/index.md
b/docs/source/contributor-guide/index.md
index 2cdeaccc15..7c19ff2e89 100644
--- a/docs/source/contributor-guide/index.md
+++ b/docs/source/contributor-guide/index.md
@@ -23,9 +23,9 @@ We welcome and encourage contributions of all kinds, such as:
1. Tickets with issue reports of feature requests
2. Documentation improvements
-3. Code (PR or PR Review)
+3. Code, both PR and (especially) PR Review.
-In addition to submitting new PRs, we have a healthy tradition of community
members helping review each other's PRs. Doing so is a great way to help the
community as well as get more familiar with Rust and the relevant codebases.
+In addition to submitting new PRs, we have a healthy tradition of community
members reviewing each other's PRs. Doing so is a great way to help the
community as well as get more familiar with Rust and the relevant codebases.
You can find a curated
[good-first-issue](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
@@ -41,6 +41,11 @@ DataFusion is a very active fast-moving project and we try
to review and merge P
Review bandwidth is currently our most limited resource, and we highly
encourage reviews by the broader community. If you are waiting for your PR to
be reviewed, consider helping review other PRs that are waiting. Such review
both helps the reviewer to learn the codebase and become more expert, as well
as helps identify issues in the PR (such as lack of test coverage), that can be
addressed and make future reviews faster and more efficient.
+Things to help look for in a PR:
+
+1. Is the feature or fix covered sufficiently with tests (see `Test
Organization` below)?
+2. Is the code clear, and fits the style of the existing codebase?
+
Since we are a worldwide community, we have contributors in many timezones who
review and comment. To ensure anyone who wishes has an opportunity to review a
PR, our committers try to ensure that at least 24 hours passes between when a
"major" PR is approved and when it is merged.
A "major" PR means there is a substantial change in design or a change in the
API. Committers apply their best judgment to determine what constitutes a
substantial change. A "minor" PR might be merged without a 24 hour delay, again
subject to the judgment of the committer. Examples of potential "minor" PRs are:
@@ -112,15 +117,17 @@ or run them all at once:
### Test Organization
+Tests are very important to ensure that improvemens or fixes are not
accidentally broken during subsequent refactorings.
+
DataFusion has several levels of tests in its [Test
Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
-and tries to follow [Testing
Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in
the The Book.
+and tries to follow rust standard [Testing
Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in
the The Book.
This section highlights the most important test modules that exist
#### Unit tests
-Tests for the code in an individual module are defined in the same source file
with a `test` module, following Rust convention
+Tests for the code in an individual module are defined in the same source file
with a `test` module, following Rust convention.
#### Rust Integration Tests
@@ -240,7 +247,7 @@ dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
## Specifications
-We formalize DataFusion semantics and behaviors through specification
+We formalize some DataFusion semantics and behaviors through specification
documents. These specifications are useful to be used as references to help
resolve ambiguities during development or code reviews.