alamb commented on code in PR #7040:
URL: https://github.com/apache/arrow-datafusion/pull/7040#discussion_r1270550382
##########
datafusion/common/src/dfschema.rs:
##########
@@ -28,18 +29,58 @@ use crate::{field_not_found, Column, OwnedTableReference,
TableReference};
use arrow::compute::can_cast_types;
use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
-use std::fmt::{Display, Formatter};
/// A reference-counted reference to a `DFSchema`.
pub type DFSchemaRef = Arc<DFSchema>;
+/// Stores identifier keys and their associated column indices. An identifier
key
Review Comment:
I think a term for this concept in the database world is "Functional
Dependence" where the value of one or more columns determines the value of one
or more others. There is classic way to represent them and a standard set of
axioms like https://en.wikipedia.org/wiki/Armstrong%27s_axioms for reasoning
about them
I recommend we change to using this the functional dependence term so that:
1. This (really cool) feature is easier to discover
2. We can rely on existing explanations of the concept more rather than
having to explain it fully in DataFusion. For example, google will tell you a
lot about this
https://www.google.com/search?q=functional+dependence+in+database. This article
seems to have a good treatment of it
https://www.scaler.com/topics/dbms/functional-dependency-in-dbms/
3. We can rely on prior art for rules and representations. In your example
with `sn` and `amount`, I think would be modeled like `{sn, amount} -> {amount}`
##########
datafusion/core/src/datasource/default_table_source.rs:
##########
@@ -52,6 +54,11 @@ impl TableSource for DefaultTableSource {
self.table_provider.schema()
}
+ /// Get a reference to primary key indices, if a primary key exists.
+ fn primary_keys(&self) -> Option<&[usize]> {
Review Comment:
Would it be possible to make this API more general? Specifically, there are
several kinds of keys that a table might have so special casing primary key
will make supporting others harder in the future
Single column primary key
```sql
create table t (
x int primary key,
y int,
z int)
```
Unique (like primary key, but can be null)
```sql
create table t (
x int unique,
y int,
z int)
```
Multi column primary key
```sql
create table t (
x int,
y int,
z int,
primary key (x, y)
)
```
Foreign key:
```sql
create table t (
x int,
y int,
z int,
foreign key (x) references other_table(a)
)
```
Perhaps we could make it this more general someting like
```rust
pub enum Constraints {
// columns together form a primary key (are unique and not null)
PrimaryKey(Vec<usize>),
// columns together form a unique key
Unique(Vec<usize>),
// More constraint types to come as we add additional support
}
```
fn constraints(&self) -> Vec<Constraints> {
...
}
```
##########
datafusion/common/src/dfschema.rs:
##########
@@ -28,18 +29,58 @@ use crate::{field_not_found, Column, OwnedTableReference,
TableReference};
use arrow::compute::can_cast_types;
use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
-use std::fmt::{Display, Formatter};
/// A reference-counted reference to a `DFSchema`.
pub type DFSchemaRef = Arc<DFSchema>;
+/// Stores identifier keys and their associated column indices. An identifier
key
+/// is a column whose value determines values of some other (dependent)
columns.
+/// These dependent columns are the "associated columns" of this identifier
key.
+/// If two rows have the same identifier key, associated columns in these rows
+/// are necessarily the same. If the identifier key is unique, the set of
+/// associated columns is equal to the entire schema and the identifier key can
+/// serve as a primary key. Note that a primary key may "downgrade" into an
+/// identifier key due to an operation such as a join, and this object is used
to
+/// track dependence relationships in such cases.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct IdentifierKeyGroup {
Review Comment:
Another potential way to implement this might be
```rust
enum FunctionalDependence {
source_indices: vec<Usize>,
target_indices: Vec<usize>,
type: Dependency,
}
enum Dependency {
Single, // unique constraint
Multi, // non unique (e.g. multipart primary key)
}
```
But I think what you have here is basically the same, except for the the
names
##########
datafusion/sql/src/select.rs:
##########
@@ -431,6 +431,51 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
group_by_exprs: Vec<Expr>,
aggr_exprs: Vec<Expr>,
) -> Result<(LogicalPlan, Vec<Expr>, Option<Expr>)> {
+ let schema = input.schema();
+ let id_key_groups = schema.identifier_key_groups();
Review Comment:
here is another good candidate to put into a method on `IdentifierKeyGroups`
with a named function which I think would keep the sql planner easier to read
and understand what was happening
##########
datafusion/common/src/dfschema.rs:
##########
@@ -116,14 +162,36 @@ impl DFSchema {
)
}
+ /// Assign identifier key groups
+ pub fn with_identifier_key_groups(
+ mut self,
+ identifier_key_groups: IdentifierKeyGroups,
+ ) -> Self {
+ self.identifier_key_groups = identifier_key_groups;
+ self
+ }
+
/// Create a new schema that contains the fields from this schema followed
by the fields
/// from the supplied schema. An error will be returned if there are
duplicate field names.
pub fn join(&self, schema: &DFSchema) -> Result<Self> {
let mut fields = self.fields.clone();
let mut metadata = self.metadata.clone();
fields.extend_from_slice(schema.fields().as_slice());
metadata.extend(schema.metadata.clone());
- Self::new_with_metadata(fields, metadata)
+ let mut id_key_groups = self.identifier_key_groups.clone();
Review Comment:
style nit is if you encapsulated this code in its own function (like
Schema::join_functional_dependencies())` it might be more self-describing
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1087,9 +1105,48 @@ pub fn build_join_schema(
}
};
+ let mut right_id_key_groups = add_offset_to_identifier_key_groups(
Review Comment:
This code I think could also be put as a method on KeyIdentiferGroups which
would make it easier to find
```rust
impl KeyIdentiferGroups {
/// Compute the output dependncies of self JOIN other with the specified
join type
fn join(&self, other:&Self, join_type: JoinType) {..}
}
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1214,6 +1271,171 @@ pub fn union(left_plan: LogicalPlan, right_plan:
LogicalPlan) -> Result<LogicalP
}))
}
+// Update entries inside the `entries` vector with their corresponding index
+// inside the `proj_indices` vector.
+fn update_elements_with_matching_indices(
+ entries: &[usize],
+ proj_indices: &[usize],
+) -> Vec<usize> {
+ entries
+ .iter()
+ .filter_map(|val| proj_indices.iter().position(|proj_idx| proj_idx ==
val))
+ .collect()
+}
+
+/// Update identifier key indices using the index mapping in `proj_indices`.
+/// Assume that `proj_indices` is \[2, 5, 8\] and identifier key groups is
+/// \[5\] -> \[5, 8\] in `df_schema`. Then, the return value will be \[1\] ->
\[1, 2\].
+/// This means that the first index of the `proj_indices` (5) is an identifier
key,
+/// and this identifier key is associated with columns at indices 1 and 2 (in
+/// the updated schema). In the updated schema, fields at indices \[2, 5, 8\]
will
+/// be at \[0, 1, 2\].
+pub fn project_identifier_key_indices(
Review Comment:
this is another good example of code that would be easier to find it if were
a member function of `IdentifierKeygroups`
##########
datafusion/common/src/dfschema.rs:
##########
@@ -28,18 +29,58 @@ use crate::{field_not_found, Column, OwnedTableReference,
TableReference};
use arrow::compute::can_cast_types;
use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
-use std::fmt::{Display, Formatter};
/// A reference-counted reference to a `DFSchema`.
pub type DFSchemaRef = Arc<DFSchema>;
+/// Stores identifier keys and their associated column indices. An identifier
key
+/// is a column whose value determines values of some other (dependent)
columns.
+/// These dependent columns are the "associated columns" of this identifier
key.
+/// If two rows have the same identifier key, associated columns in these rows
+/// are necessarily the same. If the identifier key is unique, the set of
+/// associated columns is equal to the entire schema and the identifier key can
+/// serve as a primary key. Note that a primary key may "downgrade" into an
+/// identifier key due to an operation such as a join, and this object is used
to
+/// track dependence relationships in such cases.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct IdentifierKeyGroup {
+ pub identifier_key_indices: Vec<usize>,
+ /// Flag indicating whether this identifier key is unique.
+ /// If true, these indices constitute a primary key. Otherwise, the
+ /// identifier key still uniquely determines its associated columns.
+ pub is_unique: bool,
+ pub associated_indices: Vec<usize>,
+}
+
+impl IdentifierKeyGroup {
+ pub fn new(
+ identifier_key_indices: Vec<usize>,
+ associated_indices: Vec<usize>,
+ ) -> Self {
+ Self {
+ identifier_key_indices,
+ is_unique: false,
+ associated_indices,
+ }
+ }
+
+ pub fn with_is_unique(mut self, is_unique: bool) -> Self {
+ self.is_unique = is_unique;
+ self
+ }
+}
+
+pub type IdentifierKeyGroups = Vec<IdentifierKeyGroup>;
Review Comment:
Stylistically if you made this a real type, it would:
1. encapsulate the manipulation more (like making
`add_offset_to_identifier_key_groups` a method)
2. Make finding the available operations on it easier in the future
Something like
```suggestion
pub struct IdentifierKeyGroups {
inner: Vec<IdentifierKeyGroup>;
}
impl IdentifierKeyGroups {
pub add_offset(&mut self, offset: usize) {..}
...
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]