github-actions[bot] commented on code in PR #63363:
URL: https://github.com/apache/doris/pull/63363#discussion_r3360814521


##########
be/src/util/dns_cache.cpp:
##########
@@ -97,26 +151,57 @@ Status DNSCache::_update(const std::string& hostname) {
         cache[hostname] = real_ip;
         LOG(INFO) << "update hostname " << hostname << "'s ip to " << real_ip;
     }
+    // Read failure_count under the same lock we already hold, so _refresh_once
+    // does not need a second lock acquisition to decide on eviction.
+    if (out_failures) {
+        auto fc_it = failure_count.find(hostname);
+        *out_failures = fc_it != failure_count.end() ? fc_it->second : 0;
+    }
     return Status::OK();
 }
 
+void DNSCache::_refresh_once() {
+    std::unordered_set<std::string> keys;
+    {
+        std::shared_lock<std::shared_mutex> lock(mutex);
+        std::transform(cache.begin(), cache.end(), std::inserter(keys, 
keys.end()),
+                       [](const auto& pair) { return pair.first; });
+    }
+    for (auto& key : keys) {
+        uint32_t failures = 0;
+        Status st = _update(key, &failures);
+        if (!st.ok()) {
+            // _update only returns an error when _resolve_hostname returns an
+            // empty string, which happens only if the hostname has never been
+            // successfully resolved (no cached IP to fall back to).  Keys in
+            // the refresh loop come from `cache`, so they all have a prior IP;
+            // during DNS failure _resolve_hostname returns that cached IP and
+            // _update returns OK.  This branch is therefore effectively dead
+            // under normal refresh semantics — the eviction logic below 
handles
+            // the long-running failure case instead.
+            LOG(WARNING) << "Failed to update DNS cache for hostname " << key 
<< ": "
+                         << st.to_string();
+        }
+        // Evict hostnames that have failed to resolve for too long.
+        // This avoids two pathological symptoms after a backend is dropped
+        // from the cluster and its DNS record is removed:
+        //   1) be.WARNING gets flooded with `failed to get ip from host`.
+        //   2) brpc keeps re-using the stale IP from cache, producing
+        //      `Fail to wait EPOLLOUT ... Connection timed out`.
+        int32_t threshold = config::dns_cache_max_consecutive_failures;
+        if (threshold > 0 && failures >= static_cast<uint32_t>(threshold)) {
+            LOG(WARNING) << "Evicting hostname " << key << " from DNS cache 
after " << failures
+                         << " consecutive resolution failures";
+            _erase(key);
+        }

Review Comment:
   Adding `_erase()` makes the final read in `get()` unsafe: a caller can miss 
the cache, `_update()` can resolve and insert the IP, then a refresh cycle with 
`dns_cache_max_consecutive_failures=1` can erase this hostname before `get()` 
reacquires its shared lock and executes `*ip = cache[hostname]`. That 
`operator[]` mutates the unordered_map while only a `shared_lock` is held and 
also reintroduces the hostname with an empty IP while returning `Status::OK`. 
Please avoid `operator[]` on the read path and make `get()` consume the IP 
returned by `_update()` or re-check with `find()` under the proper lock/error 
path after eviction. This is separate from the already-raised stale fallback 
reinsert in `_update()`.



##########
be/test/util/dns_cache_test.cpp:
##########
@@ -0,0 +1,254 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "util/dns_cache.h"
+
+#include <gtest/gtest-message.h>
+#include <gtest/gtest-test-part.h>
+
+#include "common/config.h"
+#include "gtest/gtest_pred_impl.h"
+
+namespace doris {
+
+class DNSCacheTest : public testing::Test {
+public:
+    DNSCacheTest() = default;
+    ~DNSCacheTest() override = default;
+
+protected:
+    void SetUp() override {
+        _saved_threshold = config::dns_cache_max_consecutive_failures;
+        _saved_log_every = config::dns_cache_log_every_n_failures;
+    }
+
+    void TearDown() override {
+        config::dns_cache_max_consecutive_failures = _saved_threshold;
+        config::dns_cache_log_every_n_failures = _saved_log_every;
+    }
+
+    // Build a resolver whose failure behaviour can be flipped at runtime.
+    static DNSCache::Resolver make_resolver(bool* should_fail, const 
std::string& ip = "1.2.3.4") {
+        return [should_fail, ip](const std::string&, std::string& out, bool) 
-> Status {
+            if (*should_fail) {
+                return Status::InternalError("mock failure");
+            }
+            out = ip;
+            return Status::OK();
+        };
+    }
+
+private:
+    int32_t _saved_threshold = 0;
+    int32_t _saved_log_every = 0;
+};
+
+// ── existing tests 
────────────────────────────────────────────────────────────
+
+// Sanity: localhost resolves successfully and is cached.
+TEST_F(DNSCacheTest, resolve_localhost) {
+    DNSCache cache;
+    std::string ip;

Review Comment:
   These tests default-construct `DNSCache`, which starts the production 
refresh thread. The destructor sets `stop_refresh` and then joins, but the 
thread is usually inside `sleep_for(std::chrono::minutes(1))`, so every test 
that uses this constructor can block for up to a minute during teardown. This 
file has four such tests, making the BE UT suite spend several minutes just 
waiting for DNSCache destructors. Please use the injected-resolver constructor 
for unit tests (or make the production thread interruptible) so tests do not 
depend on the 60s background sleep.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to