Re: [PR] ORC-2036: [C++] Optimize SortedStringDictionary performance [orc]

via GitHub Thu, 11 Dec 2025 00:14:12 -0800


HuaHuaY commented on PR #2461:
URL: https://github.com/apache/orc/pull/2461#issuecomment-3640749253


   There is another thing that can be optimized. `SortedStringDictionary` 
frequently performs hash operations on `std::string_view` in insert and 
expansion operations.
   ```cpp
   size_t SortedStringDictionary::insert(const char* str, size_t len) {
     size_t index = keyToIndex_.size();
     auto it = keyToIndex_.find(std::string_view{str, len}); // <-- hash once
     if (it != keyToIndex_.end()) {
       return it->second;
     } else {
       auto s = Arena::Create<std::string>(arena_.get(), str, len);
       keyToIndex_.emplace(std::string_view{s->data(), s->length()}, index); // 
<-- hash twice
       totalLength_ += len;
       return index;
     }
   }
   ```
   There are at least two ways to optimize this behavior:
   1. Use some map which supports `preHash`, such as latest folly's F14Map. 
First, call `preHash` and get a token. Second, use the token to find and insert.
   2. Use a wrapper to store the hash result. This method may be better than 
the previous one because the expansion operation also benefits. But I think 
it's a little hacky.
   ```cpp
   namespace orc {
     template <typename T>
     class HashWrapper {
      public:
       HashWrapper(T val) : val_(std::move(val)), 
hashVal_(std::hash<T>()(val_)) {}
   
       HashWrapper(T val, size_t hashVal) : val_(std::move(val)), 
hashVal_(hashVal) {}
   
       bool operator==(const HashWrapper& other) const {
         return val_ == other.val_;
       }
   
       size_t hash() const {
         return hashVal_;
       }
   
      private:
       T val_;
       size_t hashVal_;
     };
   }  // namespace orc
   
   namespace std {
     template <typename T>
     struct hash<orc::HashWrapper<T>> {
       size_t operator()(const orc::HashWrapper<T>& hw) const {
         return hw.hash();
       }
     };
   }  // namespace std
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ORC-2036: [C++] Optimize SortedStringDictionary performance [orc]

Reply via email to