[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7040: Utilize Primary key information, to support additional queries

via GitHub Fri, 21 Jul 2023 04:41:36 -0700


alamb commented on code in PR #7040:
URL: https://github.com/apache/arrow-datafusion/pull/7040#discussion_r1270550382



##########
datafusion/common/src/dfschema.rs:
##########
@@ -28,18 +29,58 @@ use crate::{field_not_found, Column, OwnedTableReference, 
TableReference};
 
 use arrow::compute::can_cast_types;
 use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
-use std::fmt::{Display, Formatter};
 
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// Stores identifier keys and their associated column indices. An identifier 
key

Review Comment:
   I think a term for this concept in the database world is "Functional 
Dependence" where the value of one or more columns determines the value of one 
or more others. There is classic way to represent them and a standard set of 
axioms like https://en.wikipedia.org/wiki/Armstrong%27s_axioms for reasoning 
about them 
   
   I recommend we change to using this the functional dependence term so that:
   1. This (really cool) feature is easier to discover
   2. We can rely on existing explanations of the concept more rather than 
having to explain it fully in DataFusion. For example, google will tell you a 
lot about this 
https://www.google.com/search?q=functional+dependence+in+database. This article 
seems to have a good treatment of it 
https://www.scaler.com/topics/dbms/functional-dependency-in-dbms/
   3.  We can rely on prior art for rules and representations. In your example 
with `sn` and `amount`, I think would be modeled like `{sn, amount} -> {amount}`
   
   
   



##########
datafusion/core/src/datasource/default_table_source.rs:
##########
@@ -52,6 +54,11 @@ impl TableSource for DefaultTableSource {
         self.table_provider.schema()
     }
 
+    /// Get a reference to primary key indices, if a primary key exists.
+    fn primary_keys(&self) -> Option<&[usize]> {

Review Comment:
   Would it be possible to make this API more general? Specifically, there are 
several kinds of keys that a table might have so special casing primary key 
will make supporting others harder in the future
   
   Single column primary key
   ```sql
   create table t (
   x int primary key, 
   y int, 
   z int)
   ```
   
   Unique (like primary key, but can be null)
   ```sql
   create table t (
   x int unique, 
   y int, 
   z int)
   ```
   
   Multi column primary key
   ```sql
   create table t (
   x int, 
   y int, 
   z int,
   primary key (x, y)
   )
   ```
   
   Foreign key:
   ```sql
   create table t (
   x int, 
   y int, 
   z int,
   foreign key (x) references other_table(a)
   )
   ```
   
   
   
   Perhaps we could make it this more general someting like
   
   ```rust
   pub enum Constraints {
     // columns together form a primary key (are unique and not null)
     PrimaryKey(Vec<usize>), 
     // columns together form a unique key
     Unique(Vec<usize>), 
     // More constraint types to come as we add additional support
   }
   ```
   
   fn constraints(&self) -> Vec<Constraints> {
   ...
   }
   ```
   



##########
datafusion/common/src/dfschema.rs:
##########
@@ -28,18 +29,58 @@ use crate::{field_not_found, Column, OwnedTableReference, 
TableReference};
 
 use arrow::compute::can_cast_types;
 use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
-use std::fmt::{Display, Formatter};
 
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// Stores identifier keys and their associated column indices. An identifier 
key
+/// is a column whose value determines values of some other (dependent) 
columns.
+/// These dependent columns are the "associated columns" of this identifier 
key.
+/// If two rows have the same identifier key, associated columns in these rows
+/// are necessarily the same. If the identifier key is unique, the set of
+/// associated columns is equal to the entire schema and the identifier key can
+/// serve as a primary key. Note that a primary key may "downgrade" into an
+/// identifier key due to an operation such as a join, and this object is used 
to
+/// track dependence relationships in such cases.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct IdentifierKeyGroup {

Review Comment:
   Another potential way to implement this might be
   
   
   ```rust
   enum FunctionalDependence {
     source_indices: vec<Usize>,
     target_indices: Vec<usize>,
     type: Dependency,
   }
   
   enum Dependency {
     Single, // unique constraint
     Multi, // non unique (e.g. multipart primary key)
   }
   ```
   But I think what you have here is basically the same, except for the the 
names



##########
datafusion/sql/src/select.rs:
##########
@@ -431,6 +431,51 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         group_by_exprs: Vec<Expr>,
         aggr_exprs: Vec<Expr>,
     ) -> Result<(LogicalPlan, Vec<Expr>, Option<Expr>)> {
+        let schema = input.schema();
+        let id_key_groups = schema.identifier_key_groups();

Review Comment:
   here is another good candidate to put into a method on `IdentifierKeyGroups` 
 with a named function which I think would keep the sql planner easier to read 
and understand what was happening



##########
datafusion/common/src/dfschema.rs:
##########
@@ -116,14 +162,36 @@ impl DFSchema {
         )
     }
 
+    /// Assign identifier key groups
+    pub fn with_identifier_key_groups(
+        mut self,
+        identifier_key_groups: IdentifierKeyGroups,
+    ) -> Self {
+        self.identifier_key_groups = identifier_key_groups;
+        self
+    }
+
     /// Create a new schema that contains the fields from this schema followed 
by the fields
     /// from the supplied schema. An error will be returned if there are 
duplicate field names.
     pub fn join(&self, schema: &DFSchema) -> Result<Self> {
         let mut fields = self.fields.clone();
         let mut metadata = self.metadata.clone();
         fields.extend_from_slice(schema.fields().as_slice());
         metadata.extend(schema.metadata.clone());
-        Self::new_with_metadata(fields, metadata)
+        let mut id_key_groups = self.identifier_key_groups.clone();

Review Comment:
   style nit is if you encapsulated this code in its own function (like 
Schema::join_functional_dependencies())` it might be more self-describing



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1087,9 +1105,48 @@ pub fn build_join_schema(
         }
     };
 
+    let mut right_id_key_groups = add_offset_to_identifier_key_groups(

Review Comment:
   This code I think could also be put as a method on KeyIdentiferGroups which 
would make it easier to find 
   
   ```rust
   impl KeyIdentiferGroups {
     /// Compute the output dependncies of self JOIN other with the specified 
join type
     fn join(&self, other:&Self, join_type: JoinType) {..}
   }



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -1214,6 +1271,171 @@ pub fn union(left_plan: LogicalPlan, right_plan: 
LogicalPlan) -> Result<LogicalP
     }))
 }
 
+// Update entries inside the `entries` vector with their corresponding index
+// inside the `proj_indices` vector.
+fn update_elements_with_matching_indices(
+    entries: &[usize],
+    proj_indices: &[usize],
+) -> Vec<usize> {
+    entries
+        .iter()
+        .filter_map(|val| proj_indices.iter().position(|proj_idx| proj_idx == 
val))
+        .collect()
+}
+
+/// Update identifier key indices using the index mapping in `proj_indices`.
+/// Assume that `proj_indices` is \[2, 5, 8\] and identifier key groups is
+/// \[5\] -> \[5, 8\] in `df_schema`. Then, the return value will be \[1\] -> 
\[1, 2\].
+/// This means that the first index of the `proj_indices` (5) is an identifier 
key,
+/// and this identifier key is associated with columns at indices 1 and 2 (in
+/// the updated schema). In the updated schema, fields at indices \[2, 5, 8\] 
will
+/// be at \[0, 1, 2\].
+pub fn project_identifier_key_indices(

Review Comment:
   this is another good example of code that would be easier to find it if were 
a member function of `IdentifierKeygroups`



##########
datafusion/common/src/dfschema.rs:
##########
@@ -28,18 +29,58 @@ use crate::{field_not_found, Column, OwnedTableReference, 
TableReference};
 
 use arrow::compute::can_cast_types;
 use arrow::datatypes::{DataType, Field, FieldRef, Fields, Schema, SchemaRef};
-use std::fmt::{Display, Formatter};
 
 /// A reference-counted reference to a `DFSchema`.
 pub type DFSchemaRef = Arc<DFSchema>;
 
+/// Stores identifier keys and their associated column indices. An identifier 
key
+/// is a column whose value determines values of some other (dependent) 
columns.
+/// These dependent columns are the "associated columns" of this identifier 
key.
+/// If two rows have the same identifier key, associated columns in these rows
+/// are necessarily the same. If the identifier key is unique, the set of
+/// associated columns is equal to the entire schema and the identifier key can
+/// serve as a primary key. Note that a primary key may "downgrade" into an
+/// identifier key due to an operation such as a join, and this object is used 
to
+/// track dependence relationships in such cases.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct IdentifierKeyGroup {
+    pub identifier_key_indices: Vec<usize>,
+    /// Flag indicating whether this identifier key is unique.
+    /// If true, these indices constitute a primary key. Otherwise, the
+    /// identifier key still uniquely determines its associated columns.
+    pub is_unique: bool,
+    pub associated_indices: Vec<usize>,
+}
+
+impl IdentifierKeyGroup {
+    pub fn new(
+        identifier_key_indices: Vec<usize>,
+        associated_indices: Vec<usize>,
+    ) -> Self {
+        Self {
+            identifier_key_indices,
+            is_unique: false,
+            associated_indices,
+        }
+    }
+
+    pub fn with_is_unique(mut self, is_unique: bool) -> Self {
+        self.is_unique = is_unique;
+        self
+    }
+}
+
+pub type IdentifierKeyGroups = Vec<IdentifierKeyGroup>;

Review Comment:
   Stylistically if you made this a real type, it would:
   1. encapsulate the manipulation more (like making 
`add_offset_to_identifier_key_groups` a method)
   2. Make finding the available operations on it easier in the future
   
   Something like
   
   
   ```suggestion
   pub struct IdentifierKeyGroups {
     inner:  Vec<IdentifierKeyGroup>;
   }
   
   impl IdentifierKeyGroups {
     pub add_offset(&mut self, offset: usize) {..}
     ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7040: Utilize Primary key information, to support additional queries

Reply via email to