[arrow-datafusion] branch main updated: Improve contributor guide (#5921)

alamb Mon, 10 Apr 2023 14:32:10 -0700

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git



The following commit(s) were added to refs/heads/main by this push:
     new e9edd0cb45 Improve contributor guide (#5921)
e9edd0cb45 is described below

commit e9edd0cb4592eb3ca644bc2a2a7674042486802b
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Apr 10 17:31:55 2023 -0400

    Improve contributor guide (#5921)
    
    Move some content into the code organization section
---
 datafusion-examples/README.md                 |  17 ++-
 datafusion/common/src/tree_node.rs            |   4 +-
 datafusion/core/src/lib.rs                    | 204 +++++++++++++++++---------
 datafusion/expr/src/operator.rs               |   2 +-
 datafusion/physical-expr/src/sort_expr.rs     |   2 +-
 datafusion/sql/src/planner.rs                 |   2 +-
 docs/source/contributor-guide/architecture.md |  13 +-
 docs/source/contributor-guide/index.md        |  17 ++-
 8 files changed, 169 insertions(+), 92 deletions(-)

diff --git a/datafusion-examples/README.md b/datafusion-examples/README.md
index a7d519fa41..df6ad5a467 100644
--- a/datafusion-examples/README.md
+++ b/datafusion-examples/README.md
@@ -21,10 +21,25 @@
 
 This crate includes several examples of how to use various DataFusion APIs and 
help you on your way.
 
-Prerequisites:
+## Prerequisites:
 
 Run `git submodule update --init` to init test files.
 
+## Running Examples
+
+To run the examples, use the `cargo run` command, such as:
+
+```bash
+git clone https://github.com/apache/arrow-datafusion
+cd arrow-datafusion
+# Download test data
+git submodule update --init
+
+# Run the `csv_sql` example:
+# ... use the equivalent for other examples
+cargo run --example csv_sql
+```
+
 ## Single Process
 
 - [`avro_sql.rs`](examples/avro_sql.rs): Build and run a query plan from a SQL 
statement against a local AVRO file
diff --git a/datafusion/common/src/tree_node.rs 
b/datafusion/common/src/tree_node.rs
index fcc11b0281..34d09fdc1f 100644
--- a/datafusion/common/src/tree_node.rs
+++ b/datafusion/common/src/tree_node.rs
@@ -289,7 +289,7 @@ impl<T> Transformed<T> {
 /// Helper trait for implementing [`TreeNode`] that have children stored as 
Arc's
 ///
 /// If some trait object, such as `dyn T`, implements this trait,
-/// its related Arc<dyn T> will automatically implement [`TreeNode`]
+/// its related `Arc<dyn T>` will automatically implement [`TreeNode`]
 pub trait DynTreeNode {
     /// Returns all children of the specified TreeNode
     fn arc_children(&self) -> Vec<Arc<Self>>;
@@ -303,7 +303,7 @@ pub trait DynTreeNode {
 }
 
 /// Blanket implementation for Arc for any tye that implements
-/// [`DynTreeNode`] (such as Arc<dyn PhysicalExpr>)
+/// [`DynTreeNode`] (such as [`Arc<dyn PhysicalExpr>`])
 impl<T: DynTreeNode + ?Sized> TreeNode for Arc<T> {
     fn apply_children<F>(&self, op: &mut F) -> Result<VisitRecursion>
     where
diff --git a/datafusion/core/src/lib.rs b/datafusion/core/src/lib.rs
index 1bbd9a0ebf..94f7f9e39b 100644
--- a/datafusion/core/src/lib.rs
+++ b/datafusion/core/src/lib.rs
@@ -16,16 +16,37 @@
 // under the License.
 #![warn(missing_docs, clippy::needless_borrow)]
 
-//! [DataFusion](https://github.com/apache/arrow-datafusion)
-//! is an extensible query execution framework that uses
-//! [Apache Arrow](https://arrow.apache.org) as its in-memory format.
+//! [DataFusion] is an extensible query engine written in Rust that
+//! uses [Apache Arrow] as its in-memory format. DataFusion's [use
+//! cases] include building very fast database and analytic systems,
+//! customized to particular workloads.
 //!
-//! DataFusion supports both an SQL and a DataFrame API for building logical 
query plans
-//! as well as a query optimizer and execution engine capable of parallel 
execution
-//! against partitioned data sources (CSV and Parquet) using threads.
+//! "Out of the box," DataFusion quickly runs complex [SQL] and
+//! [`DataFrame`] queries using a sophisticated query planner, a columnar,
+//! multi-threaded, vectorized execution engine, and partitioned data
+//! sources (Parquet, CSV, JSON, and Avro).
 //!
-//! Below is an example of how to execute a query against data stored
-//! in a CSV file using a [`DataFrame`](dataframe::DataFrame):
+//! DataFusion can also be easily customized to support additional
+//! data sources, query languages, functions, custom operators and
+//! more.
+//!
+//! [DataFusion]: https://arrow.apache.org/datafusion/
+//! [Apache Arrow]: https://arrow.apache.org
+//! [use cases]: 
https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases
+//! [SQL]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+//! [`DataFrame`]: dataframe::DataFrame
+//!
+//! # Examples
+//!
+//! The main entry point for interacting with DataFusion is the
+//! [`SessionContext`].
+//!
+//! [`SessionContext`]: execution::context::SessionContext
+//!
+//! ## DataFrame
+//!
+//! To execute a query against data stored
+//! in a CSV file using a [`DataFrame`]:
 //!
 //! ```rust
 //! # use datafusion::prelude::*;
@@ -64,7 +85,9 @@
 //! # }
 //! ```
 //!
-//! and how to execute a query against a CSV using SQL:
+//! ## SQL
+//!
+//! To execute a query against a CSV file using [SQL]:
 //!
 //! ```
 //! # use datafusion::prelude::*;
@@ -100,57 +123,109 @@
 //! # }
 //! ```
 //!
-//! ## Parse, Plan, Optimize, Execute
+//! ## More Examples
+//!
+//! There are many additional annotated examples of using DataFusion in the 
[datafusion-examples] directory.
+//!
+//! [datafusion-examples]: 
https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples
+//!
+//! ## Customization and Extension
+//!
+//! DataFusion supports extension at many points:
+//!
+//! * read from any datasource ([`TableProvider`])
+//! * define your own catalogs, schemas, and table lists ([`CatalogProvider`])
+//! * build your own query langue or plans using the ([`LogicalPlanBuilder`])
+//! * declare and use user-defined scalar functions ([`ScalarUDF`])
+//! * declare and use user-defined aggregate functions ([`AggregateUDF`])
+//! * add custom optimizer rewrite passes ([`OptimizerRule`] and 
[`PhysicalOptimizerRule`])
+//! * extend the planner to use user-defined logical and physical nodes 
([`QueryPlanner`])
+//!
+//! You can find examples of each of them in the [datafusion-examples] 
directory.
+//!
+//! [`TableProvider`]: crate::datasource::TableProvider
+//! [`CatalogProvider`]: crate::catalog::catalog::CatalogProvider
+//! [`LogicalPlanBuilder`]: 
datafusion_expr::logical_plan::builder::LogicalPlanBuilder
+//! [`ScalarUDF`]: physical_plan::udf::ScalarUDF
+//! [`AggregateUDF`]: physical_plan::udaf::AggregateUDF
+//! [`QueryPlanner`]: execution::context::QueryPlanner
+//! [`OptimizerRule`]: datafusion_optimizer::optimizer::OptimizerRule
+//! [`PhysicalOptimizerRule`]: 
datafusion::physical_optimizer::optimizer::PhysicalOptimizerRule
+//!
+//! # Code Organization
+//!
+//! ## Overview  Presentations
+//!
+//! The following presentations offer high level overviews of the
+//! different components and how they interact together.
+//!
+//! - [Apr 2023]: The Apache Arrow DataFusion Architecture talks
+//!   - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and 
[slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
+//!   - _Logical Plan and Expressions_: 
[recording](https://youtu.be/EzZTLiSJnhY) and 
[slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
+//!   - _Physical Plan and Execution_: 
[recording](https://youtu.be/2jkWU3_w6z0) and 
[slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is 
described in \*Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool 
with a Rusty Query Engine: 
[recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and 
[slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
+//! - [March 2021]: The DataFusion architecture is described in _Query Engine 
Design and the Rust-Based DataFusion in Apache Arrow_: 
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content 
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) 
and 
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is 
described in \*Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//!
+//! ## Architecture
 //!
 //! DataFusion is a fully fledged query engine capable of performing complex 
operations.
 //! Specifically, when DataFusion receives an SQL query, there are different 
steps
 //! that it passes through until a result is obtained. Broadly, they are:
 //!
-//! 1. The string is parsed to an Abstract syntax tree (AST) using 
[sqlparser](https://docs.rs/sqlparser/latest/sqlparser/).
-//! 2. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical 
expressions on the AST to logical expressions [`Expr`s](datafusion_expr::Expr).
-//! 3. The planner [`SqlToRel`](sql::planner::SqlToRel) converts logical nodes 
on the AST to a [`LogicalPlan`](datafusion_expr::LogicalPlan).
-//! 4. [`OptimizerRules`](optimizer::optimizer::OptimizerRule) are applied to 
the [`LogicalPlan`](datafusion_expr::LogicalPlan) to optimize it.
-//! 5. The [`LogicalPlan`](datafusion_expr::LogicalPlan) is converted to an 
[`ExecutionPlan`](physical_plan::ExecutionPlan) by a 
[`PhysicalPlanner`](physical_plan::PhysicalPlanner)
-//! 6. The [`ExecutionPlan`](physical_plan::ExecutionPlan) is executed against 
data through the [`SessionContext`](execution::context::SessionContext)
+//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
+//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to 
logical expressions [`Expr`]s.
+//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a 
[`LogicalPlan`].
+//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
+//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a 
[`PhysicalPlanner`]
+//! 6. The [`ExecutionPlan`]is executed against data through the 
[`SessionContext`]
 //!
-//! With a [`DataFrame`](dataframe::DataFrame) API, steps 1-3 are not used as 
the DataFrame builds the [`LogicalPlan`](datafusion_expr::LogicalPlan) directly.
+//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds 
the [`LogicalPlan`] directly.
 //!
 //! Phases 1-5 are typically cheap when compared to phase 6, and thus 
DataFusion puts a
 //! lot of effort to ensure that phase 6 runs efficiently and without errors.
 //!
 //! DataFusion's planning is divided in two main parts: logical planning and 
physical planning.
 //!
-//! ### Logical plan
+//! ### Logical planning
 //!
-//! Logical planning yields [`logical plans`](datafusion_expr::LogicalPlan) 
and [`logical expressions`](datafusion_expr::Expr).
-//! These are [`Schema`](arrow::datatypes::Schema)-aware traits that represent 
statements whose result is independent of how it should physically be executed.
+//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
+//! expressions which are [`Schema`]aware and represent statements
+//! whose result is independent of how it should physically be
+//! executed.
 //!
-//! A [`LogicalPlan`](datafusion_expr::LogicalPlan) is a Directed Acyclic 
Graph (DAG) of other [`LogicalPlan`s](datafusion_expr::LogicalPlan) and each 
node contains logical expressions ([`Expr`s](logical_expr::Expr)).
-//! All of these are located in [`datafusion_expr`](datafusion_expr).
+//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
+//! [`LogicalPlan`]s, and each node contains [`Expr`]s.  All of these
+//! are located in [`datafusion_expr`] module.
 //!
-//! ### Physical plan
+//! ### Physical planning
 //!
-//! A Physical plan ([`ExecutionPlan`](physical_plan::ExecutionPlan)) is a 
plan that can be executed against data.
-//! Contrarily to a logical plan, the physical plan has concrete information 
about how the calculation
-//! should be performed (e.g. what Rust functions are used) and how data 
should be loaded into memory.
+//! An [`ExecutionPlan`] (sometimes referred to as a "physical plan")
+//! is a plan that can be executed against data. Compared to a
+//! logical plan, the physical plan has concrete information about how
+//! calculations should be performed (e.g. what Rust functions are
+//! used) and how data should be loaded into memory.
 //!
-//! [`ExecutionPlan`](physical_plan::ExecutionPlan) uses the Arrow format as 
its in-memory representation of data, through the [arrow] crate.
-//! We recommend going through [its documentation](arrow) for details on how 
the data is physically represented.
+//! [`ExecutionPlan`]s uses the [Apache Arrow] format as its in-memory
+//! representation of data, through the [arrow] crate. The [arrow]
+//! crate documents how the memory is physically represented.
 //!
-//! A [`ExecutionPlan`](physical_plan::ExecutionPlan) is composed by nodes 
(implement the trait [`ExecutionPlan`](physical_plan::ExecutionPlan)),
-//! and each node is composed by physical expressions 
([`PhysicalExpr`](physical_plan::PhysicalExpr))
-//! or aggreagate expressions 
([`AggregateExpr`](physical_plan::AggregateExpr)).
-//! All of these are located in the module [`physical_plan`](physical_plan).
+//! A [`ExecutionPlan`] is composed by nodes (which each implement the
+//! [`ExecutionPlan`] trait). Each node can contain physical
+//! expressions ([`PhysicalExpr`]) or aggreagate expressions
+//! ([`AggregateExpr`]).  All of these are located in the
+//! [`physical_plan`] module.
 //!
 //! Broadly speaking,
 //!
-//! * an [`ExecutionPlan`](physical_plan::ExecutionPlan) receives a partition 
number and asynchronously returns
-//!   an iterator over [`RecordBatch`](arrow::record_batch::RecordBatch)
-//!   (a node-specific struct that implements 
[`RecordBatchReader`](arrow::record_batch::RecordBatchReader))
-//! * a [`PhysicalExpr`](physical_plan::PhysicalExpr) receives a 
[`RecordBatch`](arrow::record_batch::RecordBatch)
-//!   and returns an [`Array`](arrow::array::Array)
-//! * an [`AggregateExpr`](physical_plan::AggregateExpr) receives 
[`RecordBatch`es](arrow::record_batch::RecordBatch)
-//!   and returns a [`RecordBatch`](arrow::record_batch::RecordBatch) of a 
single row(*)
+//! * an [`ExecutionPlan`] receives a partition number and
+//!   asynchronously returns an iterator over [`RecordBatch`] (a
+//!   node-specific struct that implements [`RecordBatchReader`])
+//! * a [`PhysicalExpr`] receives a [`RecordBatch`]
+//!   and returns an [`Array`]
+//! * an [`AggregateExpr`] receives a series of [`RecordBatch`]es
+//!   and returns a [`RecordBatch`] of a single row(*)
 //!
 //! (*) Technically, it aggregates the results on each partition and then 
merges the results into a single partition.
 //!
@@ -173,39 +248,24 @@
 //! * Scan from memory: [`MemoryExec`](physical_plan::memory::MemoryExec)
 //! * Explain the plan: [`ExplainExec`](physical_plan::explain::ExplainExec)
 //!
-//! ## Customize
-//!
-//! DataFusion allows users to
-//! * extend the planner to use user-defined logical and physical nodes 
([`QueryPlanner`](execution::context::QueryPlanner))
-//! * declare and use user-defined scalar functions 
([`ScalarUDF`](physical_plan::udf::ScalarUDF))
-//! * declare and use user-defined aggregate functions 
([`AggregateUDF`](physical_plan::udaf::AggregateUDF))
-//!
-//! You can find examples of each of them in examples section.
-//!
-//! ## Examples
-//!
-//! Examples are located in [datafusion-examples 
directory](https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples)
-//!
-//! Here's how to run them
-//!
-//! ```bash
-//! git clone https://github.com/apache/arrow-datafusion
-//! cd arrow-datafusion
-//! # Download test data
-//! git submodule update --init
-//!
-//! cargo run --example csv_sql
-//!
-//! cargo run --example parquet_sql
-//!
-//! cargo run --example dataframe
-//!
-//! cargo run --example dataframe_in_memory
-//!
-//! cargo run --example simple_udaf
-//!
-//! cargo run --example simple_udf
-//! ```
+//! Future topics (coming soon):
+//! * Analyzer Rules
+//! * Resource management (memory and disk)
+//!
+//! [sqlparser]: https://docs.rs/sqlparser/latest/sqlparser
+//! [`SqlToRel`]: sql::planner::SqlToRel
+//! [`Expr`]: datafusion_expr::Expr
+//! [`LogicalPlan`]: datafusion_expr::LogicalPlan
+//! [`OptimizerRule`]: optimizer::optimizer::OptimizerRule
+//! [`ExecutionPlan`]: physical_plan::ExecutionPlan
+//! [`PhysicalPlanner`]: physical_plan::PhysicalPlanner
+//! [`Schema`]: arrow::datatypes::Schema
+//! [`datafusion_expr`]: datafusion_expr
+//! [`PhysicalExpr`]: physical_plan::PhysicalExpr
+//! [`AggregateExpr`]: physical_plan::AggregateExpr
+//! [`RecordBatch`]: arrow::record_batch::RecordBatch
+//! [`RecordBatchReader`]: arrow::record_batch::RecordBatchReader
+//! [`Array`]: arrow::array::Array
 
 /// DataFusion crate version
 pub const DATAFUSION_VERSION: &str = env!("CARGO_PKG_VERSION");
diff --git a/datafusion/expr/src/operator.rs b/datafusion/expr/src/operator.rs
index 659e0d1af3..d554c5cebb 100644
--- a/datafusion/expr/src/operator.rs
+++ b/datafusion/expr/src/operator.rs
@@ -113,7 +113,7 @@ impl Operator {
     /// Return true if the operator is a numerical operator.
     ///
     /// For example, 'Binary(a, +, b)' would be a numerical expression.
-    /// PostgresSQL concept: 
https://www.postgresql.org/docs/7.0/operators2198.htm
+    /// PostgresSQL concept: 
<https://www.postgresql.org/docs/7.0/operators2198.htm>
     pub fn is_numerical_operators(&self) -> bool {
         matches!(
             self,
diff --git a/datafusion/physical-expr/src/sort_expr.rs 
b/datafusion/physical-expr/src/sort_expr.rs
index 0f01760646..1699d25bf2 100644
--- a/datafusion/physical-expr/src/sort_expr.rs
+++ b/datafusion/physical-expr/src/sort_expr.rs
@@ -195,7 +195,7 @@ impl PhysicalSortRequirement {
     ///
     /// This function converts `PhysicalSortRequirement` to `PhysicalSortExpr`
     /// for each entry in the input. If required ordering is None for an entry
-    /// default ordering `ASC, NULLS LAST` if given (see [`into_sort_expr`])
+    /// default ordering `ASC, NULLS LAST` if given (see 
[`Self::into_sort_expr`])
     pub fn to_sort_exprs(
         requirements: impl IntoIterator<Item = PhysicalSortRequirement>,
     ) -> Vec<PhysicalSortExpr> {
diff --git a/datafusion/sql/src/planner.rs b/datafusion/sql/src/planner.rs
index 3673742d5b..f872aa5676 100644
--- a/datafusion/sql/src/planner.rs
+++ b/datafusion/sql/src/planner.rs
@@ -106,7 +106,7 @@ pub struct PlannerContext {
     /// in `PREPARE` statement
     prepare_param_data_types: Vec<DataType>,
     /// Map of CTE name to logical plan of the WITH clause.
-    /// Use Arc<LogicalPlan> to allow cheap cloning
+    /// Use `Arc<LogicalPlan>` to allow cheap cloning
     ctes: HashMap<String, Arc<LogicalPlan>>,
     /// The query schema of the outer query plan, used to resolve the columns 
in subquery
     outer_query_schema: Option<DFSchema>,
diff --git a/docs/source/contributor-guide/architecture.md 
b/docs/source/contributor-guide/architecture.md
index 081866aa8f..48c065f5b7 100644
--- a/docs/source/contributor-guide/architecture.md
+++ b/docs/source/contributor-guide/architecture.md
@@ -19,13 +19,8 @@
 
 # Architecture
 
-There is no formal document describing DataFusion's architecture yet, but the 
following presentations offer a good overview of its different components and 
how they interact together.
+DataFusion's code structure and organization is described in the
+[Crate Documentation], to keep it as close to the source as
+possible.
 
-- [Apr 2023]: The Apache Arrow DataFusion Architecture talks series by @alamb
-  - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and 
[slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
-  - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) 
and 
[slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30/edit#slide=id.gbe21b752a6_0_218)
-  - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) 
and 
[slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg/edit?usp=sharing)
-- [February 2021]: How DataFusion is used within the Ballista Project is 
described in \*Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
-- [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool 
with a Rusty Query Engine: 
[recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and 
[slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
-- [March 2021]: The DataFusion architecture is described in _Query Engine 
Design and the Rust-Based DataFusion in Apache Arrow_: 
[recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content 
starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) 
and 
[slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-- [February 2021]: How DataFusion is used within the Ballista Project is 
described in \*Ballista: Distributed Compute with Rust and Apache Arrow: 
[recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+[crate documentation]: 
https://docs.rs/datafusion/latest/datafusion/index.html#code-organization
diff --git a/docs/source/contributor-guide/index.md 
b/docs/source/contributor-guide/index.md
index 2cdeaccc15..7c19ff2e89 100644
--- a/docs/source/contributor-guide/index.md
+++ b/docs/source/contributor-guide/index.md
@@ -23,9 +23,9 @@ We welcome and encourage contributions of all kinds, such as:
 
 1. Tickets with issue reports of feature requests
 2. Documentation improvements
-3. Code (PR or PR Review)
+3. Code, both PR and (especially) PR Review.
 
-In addition to submitting new PRs, we have a healthy tradition of community 
members helping review each other's PRs. Doing so is a great way to help the 
community as well as get more familiar with Rust and the relevant codebases.
+In addition to submitting new PRs, we have a healthy tradition of community 
members reviewing each other's PRs. Doing so is a great way to help the 
community as well as get more familiar with Rust and the relevant codebases.
 
 You can find a curated
 
[good-first-issue](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
@@ -41,6 +41,11 @@ DataFusion is a very active fast-moving project and we try 
to review and merge P
 
 Review bandwidth is currently our most limited resource, and we highly 
encourage reviews by the broader community. If you are waiting for your PR to 
be reviewed, consider helping review other PRs that are waiting. Such review 
both helps the reviewer to learn the codebase and become more expert, as well 
as helps identify issues in the PR (such as lack of test coverage), that can be 
addressed and make future reviews faster and more efficient.
 
+Things to help look for in a PR:
+
+1. Is the feature or fix covered sufficiently with tests (see `Test 
Organization` below)?
+2. Is the code clear, and fits the style of the existing codebase?
+
 Since we are a worldwide community, we have contributors in many timezones who 
review and comment. To ensure anyone who wishes has an opportunity to review a 
PR, our committers try to ensure that at least 24 hours passes between when a 
"major" PR is approved and when it is merged.
 
 A "major" PR means there is a substantial change in design or a change in the 
API. Committers apply their best judgment to determine what constitutes a 
substantial change. A "minor" PR might be merged without a 24 hour delay, again 
subject to the judgment of the committer. Examples of potential "minor" PRs are:
@@ -112,15 +117,17 @@ or run them all at once:
 
 ### Test Organization
 
+Tests are very important to ensure that improvemens or fixes are not 
accidentally broken during subsequent refactorings.
+
 DataFusion has several levels of tests in its [Test
 Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
-and tries to follow [Testing 
Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in 
the The Book.
+and tries to follow rust standard [Testing 
Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in 
the The Book.
 
 This section highlights the most important test modules that exist
 
 #### Unit tests
 
-Tests for the code in an individual module are defined in the same source file 
with a `test` module, following Rust convention
+Tests for the code in an individual module are defined in the same source file 
with a `test` module, following Rust convention.
 
 #### Rust Integration Tests
 
@@ -240,7 +247,7 @@ dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
 
 ## Specifications
 
-We formalize DataFusion semantics and behaviors through specification
+We formalize some DataFusion semantics and behaviors through specification
 documents. These specifications are useful to be used as references to help
 resolve ambiguities during development or code reviews.

[arrow-datafusion] branch main updated: Improve contributor guide (#5921)

Reply via email to