raulcd commented on code in PR #46463: URL: https://github.com/apache/arrow/pull/46463#discussion_r2107169594
########## cpp/src/parquet/metadata.cc: ########## @@ -334,8 +335,19 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl { return possible_geo_stats_ != nullptr && possible_geo_stats_->is_valid(); } + inline std::shared_ptr<EncodedStatistics> encoded_statistics() const { + return is_stats_set() ? possible_encoded_stats_ : nullptr; + } + inline std::shared_ptr<Statistics> statistics() const { - return is_stats_set() ? possible_stats_ : nullptr; + if (is_stats_set()) { + // Because we are modifying possible_stats_ in a const method + const std::lock_guard<std::mutex> guard(stats_mutex_); + if (possible_stats_ == nullptr) { + possible_stats_ = MakeColumnStats(*column_metadata_, descr_); + } + } + return possible_stats_; Review Comment: I see thanks for the review @mapleFU ! I've pushed two commits for two possible solutions I can think of. I suppose we can either initialize `possible_stats_` on `is_stats_set` to avoid the possible race, see: https://github.com/apache/arrow/pull/46463/commits/da8f1cf9c1a82ef770543686937a61442f67d7ec or modify `stats_mutex_` to a `std::recursive_mutex` and allow to lock recursively at `statistics` like: https://github.com/apache/arrow/pull/46463/commits/0fe2ebc50ab3abe3fac8d9e076d34b02c81617a1 without generating `possible_stats_` on `is_stats_set`. I don't see any use of `recursive_mutex` on our code base so I suppose this is something we try to avoid. On the first case, my test performance goes to ~800ms (previous ~400 ms) because I was avoiding the generation of `possible_stats_`. If I use the second approach my test case goes to ~450 ms. My test case is the same I suggested on a different comment in case you are curious. https://github.com/apache/arrow/pull/46463#discussion_r2092773708 what are your thoughts? I would appreciate feedback on the preferred approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org