imply-cheddar commented on code in PR #13074:
URL: https://github.com/apache/druid/pull/13074#discussion_r974863239
##########
core/src/test/java/org/apache/druid/java/util/metrics/StubServiceEmitter.java:
##########
@@ -37,7 +41,18 @@ public StubServiceEmitter(String service, String host)
@Override
public void emit(Event event)
{
- events.add(event);
+ if (event instanceof AlertEvent) {
+ final AlertEvent alertEvent = (AlertEvent) event;
Review Comment:
This seems like it is attempting to use logs to validate that alert events
were fired? What's wrong with having the AlertEvents in the list? Or, maybe,
have 2 lists, one for metrics and one for alerts?
##########
server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinator.java:
##########
@@ -973,6 +973,14 @@ public List<? extends CoordinatorDuty> getDuties()
{
return duties;
}
+
+ @Override
+ public String toString()
+ {
+ return "DutiesRunnable{" +
+ "dutiesRunnableAlias='" + dutiesRunnableAlias + '\'' +
+ '}';
+ }
Review Comment:
I know this comment isn't about your code, but your addition of the
`toString` here made me wonder why `DruidCoordinator's` `toString` reads as
"DutiesRunnable". The class is probably large enough (and already depended
upon, see `@VisibleForTesting` annotation peppering this code) that maybe it's
just time to promote it to its own class.
##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentLoadingHttpClient.java:
##########
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.util.concurrent.ListenableFuture;
+import com.google.common.util.concurrent.ListeningScheduledExecutorService;
+import com.google.common.util.concurrent.MoreExecutors;
+import org.apache.druid.java.util.http.client.HttpClient;
+import org.apache.druid.java.util.http.client.Request;
+import org.apache.druid.java.util.http.client.response.HttpResponseHandler;
+import org.apache.druid.server.coordination.DataSegmentChangeCallback;
+import org.apache.druid.server.coordination.DataSegmentChangeHandler;
+import org.apache.druid.server.coordination.DataSegmentChangeRequest;
+import org.apache.druid.server.coordination.SegmentLoadDropHandler;
+import org.jboss.netty.buffer.ChannelBuffers;
+import org.jboss.netty.handler.codec.http.DefaultHttpResponse;
+import org.jboss.netty.handler.codec.http.HttpResponse;
+import org.jboss.netty.handler.codec.http.HttpResponseStatus;
+import org.jboss.netty.handler.codec.http.HttpVersion;
+import org.joda.time.Duration;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class TestSegmentLoadingHttpClient implements HttpClient
+{
+ private static final HttpResponseHandler.TrafficCop NOOP_TRAFFIC_COP =
checkNum -> 0L;
+ private static final DataSegmentChangeCallback NOOP_CALLBACK = () -> {
+ };
+
+ private final ObjectMapper objectMapper;
+ private final Function<String, DataSegmentChangeHandler> hostToHandler;
+
+ private final ListeningScheduledExecutorService executorService;
+
+ public TestSegmentLoadingHttpClient(
+ ObjectMapper objectMapper,
+ Function<String, DataSegmentChangeHandler> hostToHandler,
+ ScheduledExecutorService executorService
+ )
+ {
+ this.objectMapper = objectMapper;
+ this.hostToHandler = hostToHandler;
+ this.executorService = MoreExecutors.listeningDecorator(executorService);
+ }
+
+ @Override
+ public <Intermediate, Final> ListenableFuture<Final> go(
+ Request request,
+ HttpResponseHandler<Intermediate, Final> handler
+ )
+ {
+ throw new UnsupportedOperationException();
Review Comment:
Perhaps overly defensive, but I think that this can be an UOE with a message
about all expected usages going through the 3-argument call. That way, if this
actually does end up getting called at some point in time, the developer will
have an idea for what assumption broke.
##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/README.md:
##########
@@ -0,0 +1,141 @@
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+# Coordinator simulations
+
+The simulation framework allows developers to recreate arbitrary cluster
setups and verify coordinator behaviour. Tests
+written using the framework can also help identify performance bottlenecks or
potential bugs in the system and even
+compare different balancing strategies.
+
+As opposed to unit tests, simulations are meant to test the coordinator as a
whole and verify the interactions of all
+the underlying parts. In that regard, these simulations resemble integration
tests more closely.
+
+## Test targets
+
+The primary test target is the `DruidCoordinator` itself. The behaviour of the
following entities can also be verified
+using simulations:
+
+- `LoadQueuePeon`, `LoadQueueTaskMaster`
+- All coordinator duties, e.g. `BalanceSegments`, `RunRules`
+- All retention rules
+
+## Capabilities
+
+The framework provides control over the following aspects of the setup:
+
+| Input | Details | Actions |
+|-------|---------|---------|
+|cluster | server name, type, tier, size | add a server, remove a server|
+|segment |datasource, interval, version, partition num, size | add/remove from
server, mark used/unused, publish new segments|
+|rules | type (foreverLoad, drop, etc), replica count per tier | set rules for
a datasource|
+|configs |coordinator period, load queue type, load queue size, max segments
to balance | set or update a config |
+
+The above actions can be performed at any point after building the simulation.
So, you could even recreate scenarios
+where during a coordinator run, a server crashes or the retention rules of a
datasource change, and verify the behaviour
+of the coordinator in these situations.
+
+## Design
+
+1. __Execution__: A tight dependency on time durations such as the period of a
repeating task or the delay before a
+ scheduled task makes it difficult to reliably reproduce a test scenario. As
a result, the tests become flaky. Thus,
+ all the executors required for coordinator operations have been allowed
only two possible modes of execution:
+ - __immediate__: Execute tasks on the calling thread itself.
+ - __blocked__: Keep tasks in a queue until explicitly invoked.
+2. __Internal dependencies__: In order to allow realistic reproductions of the
coordinator behaviour, none of the
+ internal parts of the coordinator have been mocked in the framework and new
tests need not mock anything at all.
+3. __External dependencies__: Since these tests are meant to verify the
behaviour of only the coordinator, the
+ interfaces to communicate with external dependencies have been provided as
simple in-memory implementations:
+ - communication with metadata store: `SegmentMetadataManager`,
`MetadataRuleManager`
+ - communication with historicals: `HttpClient`, `ServerInventoryView`
+4. __Inventory__: The coordinator maintains an inventory view of the cluster
state. Simulations can choose from two
+ modes of inventory update - auto and manual. In auto update mode, any
change made to the cluster is immediately
+ reflected in the inventory view. In manual update mode, the inventory must
be explicitly synchronized with the
+ cluster state.
+
+## Limitations
+
+- The framework does not expose the coordinator HTTP endpoints.
+- It should not be used to verify the absolute values of execution latencies,
e.g. the time taken to compute the
+ balancing cost of a segment. But the relative values can still be a good
indicator while doing comparisons between,
+ say two balancing strategies.
Review Comment:
What's wrong with trying to use it to benchmark the execution latencies of
different balancing strategies?
Or. What's the different between "verify the absolute values of execution
latencies" and "be a good indicator while doing comparisons between, say, two
balancing strategies"?
##########
server/src/main/java/org/apache/druid/server/coordinator/duty/BalanceSegments.java:
##########
@@ -92,6 +92,7 @@ private void balanceTier(
)
{
+ log.info("Balancing segments in tier [%s]", tier);
Review Comment:
Is there any more information that can be added to this? Having just the
fact that the balancing occurred is useful, but if we can like add sizes or
anything else that might be nice to have when trying to understand what
happened, that can make it even more useful.
##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/CoordinatorSimulationBaseTest.java:
##########
@@ -0,0 +1,302 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import org.apache.druid.client.DruidServer;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.emitter.core.Event;
+import org.apache.druid.server.coordination.ServerType;
+import org.apache.druid.server.coordinator.CoordinatorDynamicConfig;
+import org.apache.druid.server.coordinator.CreateDataSegments;
+import org.apache.druid.server.coordinator.rules.ForeverLoadRule;
+import org.apache.druid.server.coordinator.rules.Rule;
+import org.apache.druid.timeline.DataSegment;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base test for coordinator simulations.
+ * <p>
+ * Each test must call {@link #startSimulation(CoordinatorSimulation)} to start
+ * the simulation. {@link CoordinatorSimulation#stop()} should not be called as
+ * the simulation is stopped when cleaning up after the test in {@link
#tearDown()}.
+ * <p>
+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.
Review Comment:
Part of me wonders if this comment doesn't (also?) belong on
`createDynamicConfig`?
##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/CoordinatorSimulationBaseTest.java:
##########
@@ -0,0 +1,299 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import org.apache.druid.client.DruidServer;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.emitter.core.Event;
+import org.apache.druid.server.coordination.ServerType;
+import org.apache.druid.server.coordinator.CoordinatorDynamicConfig;
+import org.apache.druid.server.coordinator.CreateDataSegments;
+import org.apache.druid.server.coordinator.rules.ForeverLoadRule;
+import org.apache.druid.server.coordinator.rules.Rule;
+import org.apache.druid.timeline.DataSegment;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base test for coordinator simulations.
+ * <p>
+ * Each test must call {@link #startSimulation(CoordinatorSimulation)} to start
+ * the simulation. {@link CoordinatorSimulation#stop()} should not be called as
+ * the simulation is stopped when cleaning up after the test in {@link
#tearDown()}.
+ * <p>
+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.
+ */
+public abstract class CoordinatorSimulationBaseTest
Review Comment:
You could also have the fixture and then also have a BaseTest implemented
using the fixture. Then you are effectively actually using a composable
fixture for things (and people can fall back to that if need be), but still
don't have to repeat `fixture.` in all of the places.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]