[GitHub] [arrow-datafusion] alamb opened a new issue, #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

via GitHub Thu, 16 Feb 2023 11:07:42 -0800


alamb opened a new issue, #5309:
URL: https://github.com/apache/arrow-datafusion/issues/5309


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Related to https://github.com/apache/arrow-datafusion/issues/5157
   
   There are many places in the code that use fallible functions on `DFSchema` 
to check if a column exists:
   
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
   
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
   
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column
   
   For example, there is code that looks like this (call `is_ok()` or 
`is_err()`and totally discards the error with the string)
   ```rust
   input_schema.field_from_column(col).is_ok()
   ```
   
   This is problematic because they return a DataFusionError that not only has 
an allocated `String` but also often has a nice error message. You can see them 
appearing in the trace on https://github.com/apache/arrow-datafusion/issues/5157
   
   As part of making the optimizer faster Related to 
https://github.com/apache/arrow-datafusion/issues/5157 we need to avoid these 
string allocations,
   
   Thus I propose:
   
   1. Add new functions for checking that return a bool rather than an error
   2. Replace the use of `is_err()` with 
   
   Find the field with the given qualified column
   
   For example, 
   ```rust
   impl DFSchema {
     // existing function that returns Result
     pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}
   
     // new function that returns bool  <---- Add this new function
     pub fn has_column(&self, column: &Column) -> bool {...}
   }
   ```
   
   And then replace in the code that have the pattern
   
   ```rust
   input_schema.field_from_column(col).is_ok()
   ```
   
   With 
   ```rust
   input_schema.has_column(col)
   ```
   
   
   
   **Describe the solution you'd like**
   Ideally someone would do this transition one function on DFSchema at a time 
(not one giant PR please 🙏 )
   
   **Describe alternatives you've considered**
   There are more involved proposals for larger changes to DFSchema but simply 
avoiding this check might help a lot
   
   **Additional context**
   I think this is a good first exercise as the desire is well spelled out and 
it is a software engineering exercise rather than requires deep datafusion 
expertise


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue, #5309: Make a faster way to check column existence in optimizer (not `is_err()`)

Reply via email to