HuaHuaY commented on PR #2461:
URL: https://github.com/apache/orc/pull/2461#issuecomment-3640749253
There is another thing that can be optimized. `SortedStringDictionary`
frequently performs hash operations on `std::string_view` in insert and
expansion operations.
```cpp
size_t SortedStringDictionary::insert(const char* str, size_t len) {
size_t index = keyToIndex_.size();
auto it = keyToIndex_.find(std::string_view{str, len}); // <-- hash once
if (it != keyToIndex_.end()) {
return it->second;
} else {
auto s = Arena::Create<std::string>(arena_.get(), str, len);
keyToIndex_.emplace(std::string_view{s->data(), s->length()}, index); //
<-- hash twice
totalLength_ += len;
return index;
}
}
```
There are at least two ways to optimize this behavior:
1. Use some map which supports `preHash`, such as latest folly's F14Map.
First, call `preHash` and get a token. Second, use the token to find and insert.
2. Use a wrapper to store the hash result. This method may be better than
the previous one because the expansion operation also benefits. But I think
it's a little hacky.
```cpp
namespace orc {
template <typename T>
class HashWrapper {
public:
HashWrapper(T val) : val_(std::move(val)),
hashVal_(std::hash<T>()(val_)) {}
HashWrapper(T val, size_t hashVal) : val_(std::move(val)),
hashVal_(hashVal) {}
bool operator==(const HashWrapper& other) const {
return val_ == other.val_;
}
size_t hash() const {
return hashVal_;
}
private:
T val_;
size_t hashVal_;
};
} // namespace orc
namespace std {
template <typename T>
struct hash<orc::HashWrapper<T>> {
size_t operator()(const orc::HashWrapper<T>& hw) const {
return hw.hash();
}
};
} // namespace std
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]