airborne12 commented on code in PR #64667:
URL: https://github.com/apache/doris/pull/64667#discussion_r3489990093


##########
be/CMakeLists.txt:
##########
@@ -318,6 +318,12 @@ install(DIRECTORY
     ${BASE_DIR}/dict/pinyin
     DESTINATION ${OUTPUT_DIR}/dict)
 
+# Japanese kuromoji dictionary
+install(DIRECTORY
+    ${BASE_DIR}/dict/kuromoji
+    DESTINATION ${OUTPUT_DIR}/dict
+    OPTIONAL)

Review Comment:
   [Blocking] This install rule does not guarantee that the runtime Kuromoji 
dictionary is present. The generated `dict/kuromoji/*.bin` files are 
ignored/not committed, `kuromoji_build_dict` is `EXCLUDE_FROM_ALL`, and 
`kuromoji_dict` is only a manual target. A default package can therefore ship 
only `dict/kuromoji/README.md`, while the BE later loads 
`${inverted_index_dict_path}/kuromoji`.
   
   Please make package/install depend on dictionary generation and fail if 
`system.bin`, `matrix.bin`, `chardef.bin`, and `unkdict.bin` are missing. 
`OPTIONAL` should not hide a missing required analyzer artifact.



##########
be/src/storage/index/inverted/analyzer/kuromoji/KuromojiAnalyzer.h:
##########
@@ -0,0 +1,73 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <memory>
+#include <string>
+
+#include "common/logging.h"
+#include "storage/index/inverted/analyzer/kuromoji/KuromojiTokenizer.h"
+#include "storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.h"
+
+namespace doris::segment_v2 {
+
+class KuromojiAnalyzer : public Analyzer {
+public:
+    KuromojiAnalyzer() {
+        _lowercase = true;
+        _ownReader = false;
+    }
+    ~KuromojiAnalyzer() override = default;
+
+    bool isSDocOpt() override { return true; }
+
+    // Loads (once, process-wide) the IPADIC dictionary from `dictPath`. If it 
is
+    // unavailable the tokenizer degrades to a per-codepoint split (logged), 
rather
+    // than failing index/query.
+    void initDict(const std::string& dictPath) override {

Review Comment:
   [Blocking] Missing or corrupt dictionary should not silently fall back for 
the production `kuromoji` parser. If indexing runs with `dict_ == nullptr`, 
segments are written with per-codepoint tokens; after the dictionary is 
installed/reloaded, query analyzers can produce real Kuromoji tokens, so old 
and new segments have different tokenization semantics.
   
   Please fail analyzer creation/index/query with a clear error when the 
Kuromoji dictionary cannot be loaded. If fallback tokenization is desired, 
expose it as an explicit parser/mode persisted in index metadata.



##########
be/src/storage/index/inverted/inverted_index_parser.cpp:
##########
@@ -89,8 +93,13 @@ std::string get_parser_mode_string_from_properties(
     if (parser_it == properties.end()) {
         parser_it = properties.find(INVERTED_INDEX_PARSER_KEY_ALIAS);
     }
-    if (parser_it != properties.end() && parser_it->second == 
INVERTED_INDEX_PARSER_IK) {
-        return INVERTED_INDEX_PARSER_SMART;
+    if (parser_it != properties.end()) {
+        if (parser_it->second == INVERTED_INDEX_PARSER_IK) {
+            return INVERTED_INDEX_PARSER_SMART;
+        }
+        if (parser_it->second == INVERTED_INDEX_PARSER_KUROMOJI) {
+            return INVERTED_INDEX_PARSER_KUROMOJI_SEARCH;

Review Comment:
   [Major] BE now defaults omitted `parser_mode` for `parser=kuromoji` to 
`search`, but FE `InvertedIndexProperties.getInvertedIndexParserMode()` still 
only special-cases IK and otherwise returns `coarse_grained`. Match predicate 
thrift serialization uses the FE helper, so FE and BE disagree on the effective 
default.
   
   Please update the FE default helper to return `search` for Kuromoji and add 
FE coverage for a Kuromoji index/query without an explicit `parser_mode`.



##########
be/src/storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.cpp:
##########
@@ -0,0 +1,202 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "storage/index/inverted/analyzer/kuromoji/dict/kuromoji_dictionary.h"
+
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include <cstring>
+#include <map>
+#include <mutex>
+
+#include "common/logging.h"
+
+namespace doris::segment_v2::inverted_index::kuromoji {
+
+MappedFile::~MappedFile() {
+    if (_data != nullptr) {
+        ::munmap(const_cast<uint8_t*>(_data), _size);
+        _data = nullptr;
+        _size = 0;
+    }
+}
+
+Status MappedFile::open(const std::string& path) {
+    int fd = ::open(path.c_str(), O_RDONLY);
+    if (fd < 0) {
+        return Status::IOError("kuromoji dict: cannot open {}", path);
+    }
+    struct stat st {};
+    if (::fstat(fd, &st) != 0 || st.st_size <= 0) {
+        ::close(fd);
+        return Status::IOError("kuromoji dict: cannot stat {}", path);
+    }
+    auto bytes = static_cast<std::size_t>(st.st_size);
+    void* m = ::mmap(nullptr, bytes, PROT_READ, MAP_PRIVATE, fd, 0);
+    ::close(fd);
+    if (m == MAP_FAILED) {
+        return Status::IOError("kuromoji dict: mmap failed for {}", path);
+    }
+    _data = static_cast<const uint8_t*>(m);
+    _size = bytes;
+    return Status::OK();
+}
+
+Status KuromojiDictionary::check_header(const uint8_t* p, std::size_t size, 
KmjFileKind kind) {
+    if (size < sizeof(KmjFileHeader)) {
+        return Status::Corruption("kuromoji dict: file too small");
+    }
+    KmjFileHeader h {};
+    std::memcpy(&h, p, sizeof(h));
+    if (std::memcmp(h.magic, KMJ_MAGIC, sizeof(h.magic)) != 0) {
+        return Status::Corruption("kuromoji dict: bad magic");
+    }
+    if (h.format_version != KMJ_FORMAT_VERSION) {
+        return Status::Corruption("kuromoji dict: version {} != {}", 
h.format_version,
+                                  KMJ_FORMAT_VERSION);
+    }
+    if (h.file_kind != static_cast<uint32_t>(kind)) {
+        return Status::Corruption("kuromoji dict: wrong file_kind {}", 
h.file_kind);
+    }
+    if (h.file_size != size) {
+        return Status::Corruption("kuromoji dict: file_size {} != actual {}", 
h.file_size, size);
+    }
+    return Status::OK();
+}
+
+std::string_view KuromojiDictionary::feature_at(const uint8_t* blob, uint64_t 
blob_bytes,
+                                                uint32_t off) {
+    if (off == KMJ_NO_FEATURE || blob == nullptr || static_cast<uint64_t>(off) 
+ 2 > blob_bytes) {
+        return {};
+    }
+    auto len = static_cast<uint16_t>(static_cast<uint16_t>(blob[off]) |
+                                     static_cast<uint16_t>(blob[off + 1] << 
8));
+    if (static_cast<uint64_t>(off) + 2 + len > blob_bytes) {
+        return {};
+    }
+    return {reinterpret_cast<const char*>(blob + off + 2), len};
+}
+
+Status KuromojiDictionary::map_system(const std::string& path) {
+    RETURN_IF_ERROR(_system_map.open(path));
+    const uint8_t* p = _system_map.data();
+    RETURN_IF_ERROR(check_header(p, _system_map.size(), KMJ_KIND_SYSTEM));
+    KmjSystemHeader s {};
+    std::memcpy(&s, p + sizeof(KmjFileHeader), sizeof(s));

Review Comment:
   [Blocking] `check_header()` only validates the common header, but the loader 
then trusts all sub-header offsets/counts and installs mmap pointers from them. 
A header-valid but truncated/corrupt `*.bin` can make these `memcpy`/pointer 
assignments and later accessors read past the mmap.
   
   Please validate each sub-header before exposing pointers: `sizeof(header + 
subheader)`, every `offset + count * sizeof(T)` range, integer overflow, trie 
byte alignment, `class_count == CAT_CLASS_COUNT`, matrix dimensions/cell count, 
run ranges into entry arrays, feature offsets, and word left/right IDs against 
matrix bounds. Return `Status::Corruption` for invalid artifacts.



##########
be/src/storage/index/inverted/analyzer/kuromoji/KuromojiMode.h:
##########
@@ -0,0 +1,41 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <string>
+
+namespace doris::segment_v2 {
+
+// Segmentation mode, mirroring Lucene's JapaneseTokenizer.Mode. Normal returns
+// the minimum-cost segmentation. Search additionally decomposes long compounds
+// into their shorter parts (via a length-based cost penalty) for better search
+// recall. Extended applies the Search penalty and also splits unknown
+// (out-of-vocabulary) words into per-character unigrams.
+enum class KuromojiMode { Normal, Search, Extended };
+
+inline KuromojiMode kuromoji_mode_from_string(const std::string& mode) {
+    if (mode == "normal") {
+        return KuromojiMode::Normal;
+    }
+    if (mode == "extended") {
+        return KuromojiMode::Extended;
+    }
+    return KuromojiMode::Search; // default (matches OpenSearch/Lucene)

Review Comment:
   [Major] This silently maps any unknown Kuromoji mode to `Search`. That makes 
`TOKENIZE(..., '"parser"="kuromoji","parser_mode"="bogus"')` accepted and 
executed as `search`, while index DDL rejects the same value.
   
   Please make mode parsing return an error/status for unknown values, and 
apply the same Kuromoji property validation to `TOKENIZE` as DDL/index creation.



##########
regression-test/suites/inverted_index_p0/analyzer/test_japanese_analyzer.groovy:
##########
@@ -0,0 +1,76 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+suite("test_japanese_analyzer", "p0") {
+    def tableName = "test_japanese_analyzer"
+
+    def backendId_to_backendIP = [:]
+    def backendId_to_backendHttpPort = [:]
+    getBackendIpHttpPort(backendId_to_backendIP, backendId_to_backendHttpPort)
+    def set_be_config = { key, value ->
+        for (String backend_id : backendId_to_backendIP.keySet()) {
+            update_be_config(backendId_to_backendIP.get(backend_id),
+                             backendId_to_backendHttpPort.get(backend_id), 
key, value)
+        }
+    }
+
+    sql "DROP TABLE IF EXISTS ${tableName}"
+    // kuromoji is disabled by default; enable it for this test.
+    set_be_config("enable_kuromoji_analyzer", "true")
+    try {
+        sql """
+          CREATE TABLE ${tableName} (
+            `id` int(11) NULL COMMENT "",
+            `content` text NULL COMMENT "",
+            INDEX content_idx (`content`) USING INVERTED PROPERTIES("parser" = 
"kuromoji", "parser_mode" = "search") COMMENT '',
+          ) ENGINE=OLAP
+          DUPLICATE KEY(`id`)
+          COMMENT "OLAP"
+          DISTRIBUTED BY RANDOM BUCKETS 1
+          PROPERTIES (
+            "replication_allocation" = "tag.location.default: 1"
+          );
+        """
+
+        sql """ INSERT INTO ${tableName} VALUES (1, "東京都に住んでいます"); """
+        sql """ INSERT INTO ${tableName} VALUES (2, "私は寿司が好きです"); """
+        sql """ INSERT INTO ${tableName} VALUES (3, "Apache Doris は高速です"); """
+        sql "sync"
+
+        // The kuromoji dictionary is not shipped in the p0 package, so the

Review Comment:
   [Blocking] This regression is proving the fallback path, not the feature 
added by this PR. Because the p0 package does not ship the dictionary, CI can 
pass while the real Kuromoji/IPADIC morphology path is never exercised.
   
   Please add a required CI/regression path that generates/ships the dictionary 
and asserts real morphology behavior, e.g. search-mode decompound (`東京都` 
matching `東京`), base-form/POS behavior, and extended-mode unknown-word 
splitting. Keep fallback coverage separate if fallback remains explicit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to