raulcd commented on code in PR #46463:
URL: https://github.com/apache/arrow/pull/46463#discussion_r2107169594


##########
cpp/src/parquet/metadata.cc:
##########
@@ -334,8 +335,19 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
     return possible_geo_stats_ != nullptr && possible_geo_stats_->is_valid();
   }
 
+  inline std::shared_ptr<EncodedStatistics> encoded_statistics() const {
+    return is_stats_set() ? possible_encoded_stats_ : nullptr;
+  }
+
   inline std::shared_ptr<Statistics> statistics() const {
-    return is_stats_set() ? possible_stats_ : nullptr;
+    if (is_stats_set()) {
+      // Because we are modifying possible_stats_ in a const method
+      const std::lock_guard<std::mutex> guard(stats_mutex_);
+      if (possible_stats_ == nullptr) {
+        possible_stats_ = MakeColumnStats(*column_metadata_, descr_);
+      }
+    }
+    return possible_stats_;

Review Comment:
   I see thanks for the review @mapleFU ! I've pushed two commits for two 
possible solutions I can think of.
   
   I suppose we can either initialize `possible_stats_` on `is_stats_set` to 
avoid the possible race, see:
   
https://github.com/apache/arrow/pull/46463/commits/da8f1cf9c1a82ef770543686937a61442f67d7ec
   
   or modify `stats_mutex_` to a `std::recursive_mutex` and allow to lock 
recursively at `statistics` like:
   
https://github.com/apache/arrow/pull/46463/commits/0fe2ebc50ab3abe3fac8d9e076d34b02c81617a1
   without generating `possible_stats_` on `is_stats_set`.
   
   I don't see any use of `recursive_mutex` on our code base so I suppose this 
is something we try to avoid.
   
   On the first case, my test performance goes to ~800ms (previous ~400 ms) 
because I was avoiding the generation of `possible_stats_`.
   If I use the second approach my test case goes to ~450 ms.
   
   My test case is the same I suggested on a different comment in case you are 
curious.
   https://github.com/apache/arrow/pull/46463#discussion_r2092773708
   
   what are your thoughts? I would appreciate feedback on the preferred 
approach?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to