[GitHub] [arrow-rs] alamb commented on a change in pull request #1033: Projection on Schema and RecordBatch

GitBox Thu, 16 Dec 2021 13:49:34 -0800


alamb commented on a change in pull request #1033:
URL: https://github.com/apache/arrow-rs/pull/1033#discussion_r770938887




##########
File path: arrow/src/record_batch.rs
##########
@@ -175,6 +175,28 @@ impl RecordBatch {
         self.schema.clone()
     }
 
+    /// Projects the schema onto the specified columns
+    pub fn project(
+        &self,
+        indices: impl IntoIterator<Item = usize> + Clone,
+    ) -> Result<RecordBatch> {
+        let projected_schema = self.schema.project(indices.clone())?;
+        let batch_fields = indices
+            .into_iter()
+            .map(|f: usize| {
+                self.columns.get(f).cloned().ok_or_else(|| {
+                    ArrowError::SchemaError(format!(
+                        "project index {} out of bounds, max field {}",
+                        f,
+                        self.columns.len()
+                    ))
+                })
+            })
+            .collect::<Result<Vec<_>>>()?;
+
+        RecordBatch::try_new(SchemaRef::new(projected_schema), batch_fields)
+    }
+

Review comment:
       How about some tests?
   
   Perhaps something like
   
   ```rust
       #[test]
       fn project() {
           let a: ArrayRef = Arc::new(Int32Array::from(vec![
               Some(1),
               None,
               Some(3),
           ]));
           let b: ArrayRef = Arc::new(StringArray::from(vec!["a", "b", "c"]));
           let c: ArrayRef = Arc::new(StringArray::from(vec!["d", "e", "f"]));
   
           let record_batch = RecordBatch::try_from_iter(vec![("a", a.clone()), 
("b", b.clone()), ("c", c.clone())])
               .expect("valid conversion");
   
           let expected = RecordBatch::try_from_iter(vec![("a", a), ("c", c)])
               .expect("valid conversion");
   
           assert_eq!(expected, record_batch.project(&vec![0, 2]).unwrap());
       }
   ```

##########
File path: arrow/src/record_batch.rs
##########
@@ -175,6 +175,28 @@ impl RecordBatch {
         self.schema.clone()
     }
 
+    /// Projects the schema onto the specified columns
+    pub fn project(
+        &self,
+        indices: impl IntoIterator<Item = usize> + Clone,
+    ) -> Result<RecordBatch> {
+        let projected_schema = self.schema.project(indices.clone())?;

Review comment:
       I see now why you needed to make the iter `Clone` which is kind of 
annoying 🤔 

##########
File path: arrow/src/datatypes/schema.rs
##########
@@ -87,6 +87,24 @@ impl Schema {
         Self { fields, metadata }
     }
 
+    /// Returns a new schema with only the specified columns in the new schema
+    /// This carries metadata from the parent schema over as well
+    pub fn project(&self, indices: impl IntoIterator<Item = usize>) -> 
Result<Schema> {

Review comment:
       I know I did something different in the ticket, but I think this 
interface is kind of annoying.
   
   Namely, I couldn't pass in `&vec![1, 2]` 
   
   ```
      --> arrow/src/datatypes/schema.rs:405:40
       |
   405 |         let projected: Schema = schema.project(&vec![0, 2]).unwrap();
       |                                        ^^^^^^^ expected `&{integer}`, 
found `usize`
   
   ```
   
   What would you think about being less fancy and changing this (and 
`RecordBatch`) to something like:
   
   ```rust
       pub fn project(&self, indices: &[size]) -> Result<Schema> {
   ```
   
   Which would then avoid the need for the clone on `RecordBatch::project` as 
well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] alamb commented on a change in pull request #1033: Projection on Schema and RecordBatch

Reply via email to