scovich commented on code in PR #7833:
URL: https://github.com/apache/arrow-rs/pull/7833#discussion_r2180382588
##########
parquet-variant/src/builder.rs:
##########
@@ -240,6 +240,18 @@ struct MetadataBuilder {
}
impl MetadataBuilder {
+ /// Pre-populates the list of field names
+ fn from_field_names<'a>(field_name: impl Iterator<Item = &'a str>) -> Self
{
+ Self {
+ field_names: IndexSet::from_iter(field_name.map(|f|
f.to_string())),
+ }
+ }
+
+ /// Checks whether field names by insertion order is lexicographically
sorted
+ fn is_sorted(&self) -> bool {
+ !self.field_names.is_empty() && self.field_names.iter().is_sorted()
Review Comment:
This is an expensive check, especially for a larger dictionary. Since the
builder is append-only (can't change or remove existing entries), we could
incrementally maintain an `is_sorted` flag instead?
Although the total number of operations is the same, it should be cheaper
because the bytes of a freshly inserted field name will still be in CPU cache
(having just been hashed). So the string comparison will be a _lot_ faster than
if it takes a cache miss after not having been accessed for a long time.
<details>
The builder would default to is_sorted=true, but switches to false if any
newly inserted field name breaks ordering:
```rust
fn upsert_field_name(&mut self, field_name: &str) -> u32 {
let (id, inserted) =
self.field_names.insert_full(field_name.to_string());
if inserted {
let n = self.field_names.len();
if n >= 2 && self.field_names[n-2] > self.field_names[n-1] {
self.is_sorted = false;
}
}
id as u32
}
```
And then when initializing from an iterator, just insert each entry manually
in a loop
```rust
let mut new_self = Self::new();
for field_name in field_names {
let _ = new_self.upsert_field_name(field_name);
}
new_self
```
Actually, if we `impl Extend` for `MetadataBuilder`, then the initialization
becomes even simpler:
```rust
let mut new_self = Self::new();
new_self.extend(field_names);
new_self
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]