jiacai2050 commented on code in PR #1372:
URL:
https://github.com/apache/incubator-horaedb/pull/1372#discussion_r1432096054
##########
analytic_engine/src/instance/flush_compaction.rs:
##########
@@ -1062,6 +1079,40 @@ impl SpaceStore {
}
}
+/// Collect the column stats from a batch of sst meta data.
+fn collect_column_stats_from_meta_datas(metas: &[SstMetaData]) ->
HashMap<String, ColumnStats> {
+ let mut low_cardinality_counts: HashMap<String, usize> = HashMap::new();
+ for meta_data in metas {
+ let SstMetaData::Parquet(meta_data) = meta_data;
+ if let Some(column_values) = &meta_data.column_values {
+ for (col_idx, val_set) in column_values.iter().enumerate() {
+ let low_cardinality = val_set.is_some();
+ if low_cardinality {
+ let col_name =
meta_data.schema.column(col_idx).name.clone();
+ low_cardinality_counts
+ .entry(col_name)
+ .and_modify(|v| *v += 1)
+ .or_insert(1);
+ }
+ }
+ }
+ }
+
+ // Only the column whose cardinality is low in all the metas is a
+ // low-cardinality column.
+ let low_cardinality_cols = low_cardinality_counts
Review Comment:
Beside this, we may also need to check if column values from all sst merged
is still in low candinality.
In prod this may be not a serious problem, we can leave a comments here to
do this in future.
##########
analytic_engine/src/instance/flush_compaction.rs:
##########
@@ -956,7 +963,10 @@ impl SpaceStore {
.await
.context(ReadSstMeta)?;
- MetaData::merge(sst_metas.into_iter().map(MetaData::from), schema)
+ let column_stats =
collect_column_stats_from_meta_datas(&sst_metas);
+
Review Comment:
```suggestion
```
##########
analytic_engine/src/instance/flush_compaction.rs:
##########
@@ -1062,6 +1079,40 @@ impl SpaceStore {
}
}
+/// Collect the column stats from a batch of sst meta data.
+fn collect_column_stats_from_meta_datas(metas: &[SstMetaData]) ->
HashMap<String, ColumnStats> {
+ let mut low_cardinality_counts: HashMap<String, usize> = HashMap::new();
Review Comment:
Could we init its capacity? we have very wide table in prod.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]