Ryan19929 commented on PR #64667:
URL: https://github.com/apache/doris/pull/64667#issuecomment-4863101175
> Since this is modeled after Lucene Kuromoji, have you checked how the
Doris implementation performs in practice? A small benchmark for indexing
throughput would be helpful.
## [Non-blocking] Viterbi hot path allocates per byte position; +16%
measured with a small change
here is what I measured (Release build, single pipeline task, sql cache off,
50×1MB natural text, `sum(length(TOKENIZE(...)))`, best of 4):
| parser | corpus | 50MB time | throughput |
|----------------------------|--------|-----------|-----------------|
| kuromoji (this PR) | ja | 19.3 s | 2.6 MB/s |
| kuromoji (prototype below) | ja | 16.2 s | 3.1 MB/s (+16%) |
| icu | ja | 11.2 s | 4.5 MB/s |
| ik | zh | 11.0 s | 4.5 MB/s |
| chinese | zh | 6.7 s | 7.5 MB/s |
To be fair, the other rows are not apples-to-apples baselines: icu does much
lighter work on Japanese than a full lattice/Viterbi morphological analysis,
and ik/chinese run on a Chinese corpus, so some gap is expected and inherent to
what kuromoji does. I'm only including them as a rough sense of scale —
kuromoji is the slowest builtin analyzer but stays within the same order of
magnitude, so I don't see this as blocking.
That said, a profile shows a good chunk of the time goes to avoidable heap
allocations, so there are two cheap wins in `KuromojiViterbi::segment()`:
- `std::vector<std::vector<int>> ending_at(n + 1)`
(`kuromoji_viterbi.cpp:124`): one vector object per document byte (~1M
constructions for a 1MB doc) plus one heap allocation per reachable position.
- `matches` declared inside the per-position loop
(`kuromoji_viterbi.cpp:170`): one malloc/free per position;
`common_prefix_search()` already clears it, so it can be hoisted.
Prototype of exactly these two changes: 19.3s → 16.2s, byte-identical
tokenizer output on the 50MB corpus. The `<` → `<=` flip preserves the original
tie-break (chain iterates newest-first, original vector oldest-first).
```diff
--- a/be/src/storage/index/inverted/analyzer/kuromoji/kuromoji_viterbi.cpp
+++ b/be/src/storage/index/inverted/analyzer/kuromoji/kuromoji_viterbi.cpp
@@ -121,18 +121,25 @@ void KuromojiViterbi::segment(std::string_view text,
std::vector<KuromojiMorphem
}
std::vector<VNode> nodes;
- std::vector<std::vector<int>> ending_at(n + 1); // node indices ending
at each byte position
+ // Intrusive per-end-position chain: end_head[e] is the most recent
node index
+ // ending at byte position e, end_next[i] links to the previous one.
This avoids
+ // allocating n+1 std::vector objects per document.
+ std::vector<int32_t> end_head(n + 1, -1);
+ std::vector<int32_t> end_next;
// BOS (index 0): ends at position 0, context id 0, zero cost.
nodes.push_back(VNode {0, 0, 0, 0, 0, false, 0, 0, -1});
- ending_at[0].push_back(0);
+ end_next.push_back(-1);
+ end_head[0] = 0;
// Add a node and relax it against all nodes ending at its start
position.
auto add_node = [&](uint32_t s, uint32_t e, int16_t lid, int16_t rid,
int16_t wcost, bool known,
uint32_t wid) {
int64_t best = KMJ_INF;
int best_prev = -1;
- for (int pe : ending_at[s]) {
+ // Chain is iterated newest-first; "<=" keeps the oldest node on
cost ties,
+ // matching the original insertion-order "<" selection exactly.
+ for (int pe = end_head[s]; pe >= 0; pe = end_next[pe]) {
const VNode& pv = nodes[static_cast<std::size_t>(pe)];
if (pv.total_cost >= KMJ_INF) {
continue;
@@ -140,7 +147,7 @@ void KuromojiViterbi::segment(std::string_view text,
std::vector<KuromojiMorphem
const int64_t c =
pv.total_cost +
_dict.connection_cost(static_cast<uint32_t>(pv.right_id),
static_cast<uint32_t>(lid));
- if (c < best) {
+ if (c <= best) {
best = c;
best_prev = pe;
}
@@ -154,12 +161,15 @@ void KuromojiViterbi::segment(std::string_view text,
std::vector<KuromojiMorphem
const auto idx = static_cast<int>(nodes.size());
nodes.push_back(
VNode {s, e, lid, rid, wcost, known, wid, best + wcost +
penalty, best_prev});
- ending_at[e].push_back(idx);
+ end_next.push_back(end_head[e]);
+ end_head[e] = idx;
};
uint32_t pos = 0;
+ // Reused across positions; common_prefix_search clears it on entry.
+ std::vector<KuromojiDictionary::PrefixMatch> matches;
while (pos < n) {
- if (ending_at[pos].empty()) {
+ if (end_head[pos] < 0) {
pos += decode_utf8(text, pos).len; // unreachable boundary; skip
continue;
}
@@ -167,7 +177,6 @@ void KuromojiViterbi::segment(std::string_view text,
std::vector<KuromojiMorphem
const auto before = nodes.size();
// System-dictionary words (common-prefix search).
- std::vector<KuromojiDictionary::PrefixMatch> matches;
_dict.common_prefix_search(text.data() + pos, n - pos, &matches);
bool any_known = false;
for (const auto& mt : matches) {
@@ -219,14 +228,14 @@ void KuromojiViterbi::segment(std::string_view text,
std::vector<KuromojiMorphem
// EOS: best node ending at n connected to the EOS context (id 0).
int64_t best = KMJ_INF;
int best_prev = -1;
- for (int pe : ending_at[n]) {
+ for (int pe = end_head[n]; pe >= 0; pe = end_next[pe]) {
const VNode& pv = nodes[static_cast<std::size_t>(pe)];
if (pv.total_cost >= KMJ_INF) {
continue;
}
const int64_t c =
pv.total_cost +
_dict.connection_cost(static_cast<uint32_t>(pv.right_id), 0);
- if (c < best) {
+ if (c <= best) {
best = c;
best_prev = pe;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]