wgtmac commented on code in PR #2337:
URL: https://github.com/apache/orc/pull/2337#discussion_r2221335595
##########
c++/src/ColumnWriter.cc:
##########
@@ -939,6 +939,14 @@ namespace orc {
std::unique_ptr<std::string> data;
};
+ struct DictEntryWithIndex {
+ DictEntryWithIndex(const char* str, size_t len, size_t index)
+ : entry(str, len), index(index) {}
+
+ DictEntry entry;
+ size_t index;
+ };
+
SortedStringDictionary() : totalLength_(0) {
Review Comment:
Can we add `src/Dictionary.hh` and `src/Dictionary.cc` to make it abstract?
We can use macro and pimpl idiom to avoid virtual functions. The reason is that
adding a new dependency may enforce downstream projects (e.g. Apache Arrow) to
do the same thing to manage an extra dependency with marginal benefit. We can
add a new CMake option `ORC_ENABLE_SPARSEHASH` with `OFF` by default. In our CI
settings, it can be turned on.
##########
c++/src/ColumnWriter.cc:
##########
@@ -962,8 +973,13 @@ namespace orc {
void clear();
private:
+ struct LessThan {
+ bool operator()(const DictEntryWithIndex& l, const DictEntryWithIndex&
r) {
+ return l.entry.data < r.entry.data; // use std::string's operator<
Review Comment:
Originally I used `memcmp` to avoid wrong results if there is any `\0` in
the string. Does `std::string's operator<` resolve this issue?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]