alamb commented on a change in pull request #1033:
URL: https://github.com/apache/arrow-rs/pull/1033#discussion_r768132120
##########
File path: arrow/src/datatypes/schema.rs
##########
@@ -87,6 +87,18 @@ impl Schema {
Self { fields, metadata }
}
+
+ /// Returns a new schema with only the specified columns in the new schema
+ /// This carries metadata from the parent schema over as well
+ pub fn project(&self, indices: impl IntoIterator<Item=usize>) ->
Result<Schema> {
+ let mut new_fields = vec![];
+ for i in indices {
+ let f = self.fields[i].clone();
+ new_fields.push(f);
+ }
Review comment:
I think as written
1. This will `panic!` if there the index is not in bounds:
2. is not "idiomatic rust style" (which to me means avoid `mut`). This is
far less important
How about something such as (untested):
```suggestion
let new_fields = indices
.into_iter()
.map(|i| {
self.fields.get(i).map(|f| f.clone()))
.ok_or_else(|| Err(ArrowError::SchemaError(
format!("project index {} out of bounds, max field {}"
i, self.fields().len()),
))
})
.collect::<Result<Vec<_>>>()?;
```
Note the use of https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get
to avoid `fields[i]` and then the somewhat confusing use of turbofish
`.collect::<Result<Vec<_>>()` -- it took me quite a while to get used to that
pattern
##########
File path: arrow/src/datatypes/schema.rs
##########
@@ -369,4 +381,23 @@ mod tests {
assert_eq!(schema, de_schema);
}
+
+ #[test]
+ fn test_project() {
+ let mut metadata = HashMap::new();
+ metadata.insert("meta".to_string(), "data".to_string());
+
+ let schema = Schema::new_with_metadata(vec![
+ Field::new("name", DataType::Utf8, false),
+ Field::new("address", DataType::Utf8, false),
+ Field::new("priority", DataType::UInt8, false),
+ ], metadata);
+
+ let projected: Schema = schema.project(vec![0, 2]).unwrap();
+
+ assert_eq!(projected.fields().len(), 2);
+ assert_eq!(projected.fields()[0].name(), "name");
+ assert_eq!(projected.fields()[1].name(), "priority");
+ assert_eq!(projected.metadata.get("meta").unwrap(), "data")
+ }
Review comment:
Related to above -- I recommend a test for handling if index is out of
bounds -- like `schema.project([2, 3])`
##########
File path: arrow/src/record_batch.rs
##########
@@ -175,6 +175,12 @@ impl RecordBatch {
self.schema.clone()
}
+
+ /// Projects the schema onto the specified columns
+ pub fn project(&self, indices: impl IntoIterator<Item=usize>) ->
Result<Schema> {
Review comment:
The intent of this field was to project the `RecordBatch` rather than
just the schema:
A signature like this:
```suggestion
pub fn project(&self, indices: impl IntoIterator<Item=usize>) ->
Result<RecordBatch> {
```
(so we would also have to project the columns as well as the schema)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]