github-actions[bot] commented on code in PR #63633:
URL: https://github.com/apache/doris/pull/63633#discussion_r3360752364


##########
be/src/storage/index/inverted/spimi/query_term_enum.cpp:
##########
@@ -0,0 +1,302 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "storage/index/inverted/spimi/query_term_enum.h"
+
+#include "common/logging.h"
+
+namespace doris::segment_v2::inverted_index::spimi {
+
+namespace {
+
+// Inline byte cursor mirroring the one in `term_dict_reader.cpp`.
+// Kept local rather than extracted to a shared header because each
+// reader has slightly different needs (this one needs to be
+// re-entrant via the enclosing class's `_pos` field, while
+// TermDictReader uses a local stack instance).
+class Cursor {
+public:
+    Cursor(const uint8_t* data, size_t len, size_t& pos) : _data(data), 
_len(len), _pos(pos) {}
+
+    uint8_t ReadByte() {
+        if (_pos >= _len) [[unlikely]] {
+            _CLTHROWA(CL_ERR_IO, "SPIMI .tis read past end of buffer");
+        }
+        return _data[_pos++];
+    }
+    int32_t ReadInt32BE() {
+        const uint32_t b0 = ReadByte();
+        const uint32_t b1 = ReadByte();
+        const uint32_t b2 = ReadByte();
+        const uint32_t b3 = ReadByte();
+        return static_cast<int32_t>((b0 << 24) | (b1 << 16) | (b2 << 8) | b3);
+    }
+    int64_t ReadInt64BE() {
+        const uint64_t hi = static_cast<uint32_t>(ReadInt32BE());
+        const uint64_t lo = static_cast<uint32_t>(ReadInt32BE());
+        return static_cast<int64_t>((hi << 32) | lo);
+    }
+    int32_t ReadVInt() {
+        uint32_t v = 0;
+        uint32_t shift = 0;
+        while (true) {
+            const uint8_t b = ReadByte();
+            v |= static_cast<uint32_t>(b & 0x7FU) << shift;

Review Comment:
   `SpimiQueryTermEnum` is driven by on-disk `.tis` bytes during query-time 
term enumeration, but these VInt/VLong decoders do not bound `shift`. A corrupt 
V4 index containing enough continuation bytes reaches 
`static_cast<uint32_t>(...) << shift` with `shift >= 32` here, or `uint64_t << 
shift` with `shift >= 64` in `ReadVLong()`, which is undefined behavior in 
release builds before `ReadByte()` can report end-of-buffer. The production 
lookup decoder in `TermDictReader::ByteCursor` already has the required 
overflow checks; please apply the same bounds here so malformed V4 files return 
a corruption error instead of executing UB during term enumeration.



##########
be/src/storage/index/inverted/inverted_index_writer.cpp:
##########
@@ -630,12 +996,50 @@ Status InvertedIndexColumnWriter<field_type>::finish() {
                                       "debug point: test throw error in 
fulltext "
                                       "index writer");
                         });
+                if (_spimi_writer != nullptr && _spimi_writer->HasBuffer()) {
+                    // Delegate the entire SPIMI segment emission to the
+                    // SpimiIndexWriter facade. It handles saturated-buffer
+                    // checks, file naming, IndexOutput creation, direct
+                    // emit vs spill-merge path selection, FINALLY_CLOSE,

Review Comment:
   `SpimiIndexWriter::Finish()` now catches `...` and rethrows the original 
exception after closing its own streams. If that original exception is 
`std::bad_alloc` or another non-`CLuceneError`/non-`doris::Exception`, this 
outer `finish()` does not catch it, so execution skips the `FINALLY({ ... 
FINALLY_CLOSE(_dir) ... })` block below and escapes a `Status`-returning write 
path as a C++ exception. This leaves the null-bitmap/directory cleanup path 
inconsistent and can abort the load instead of returning an error `Status`. 
Please catch `std::exception`/`...` here as well (or make 
`SpimiIndexWriter::Finish()` convert unknown exceptions to `doris::Exception`) 
so every SPIMI finish failure still flows through the existing FINALLY cleanup 
and Status conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to