imply-cheddar commented on code in PR #13074:
URL: https://github.com/apache/druid/pull/13074#discussion_r974863239


##########
core/src/test/java/org/apache/druid/java/util/metrics/StubServiceEmitter.java:
##########
@@ -37,7 +41,18 @@ public StubServiceEmitter(String service, String host)
   @Override
   public void emit(Event event)
   {
-    events.add(event);
+    if (event instanceof AlertEvent) {
+      final AlertEvent alertEvent = (AlertEvent) event;

Review Comment:
   This seems like it is attempting to use logs to validate that alert events 
were fired?  What's wrong with having the AlertEvents in the list?  Or, maybe, 
have 2 lists, one for metrics and one for alerts?



##########
server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinator.java:
##########
@@ -973,6 +973,14 @@ public List<? extends CoordinatorDuty> getDuties()
     {
       return duties;
     }
+
+    @Override
+    public String toString()
+    {
+      return "DutiesRunnable{" +
+             "dutiesRunnableAlias='" + dutiesRunnableAlias + '\'' +
+             '}';
+    }

Review Comment:
   I know this comment isn't about your code, but your addition of the 
`toString` here made me wonder why `DruidCoordinator's` `toString` reads as 
"DutiesRunnable".  The class is probably large enough (and already depended 
upon, see `@VisibleForTesting` annotation peppering this code) that maybe it's 
just time to promote it to its own class. 



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentLoadingHttpClient.java:
##########
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.util.concurrent.ListenableFuture;
+import com.google.common.util.concurrent.ListeningScheduledExecutorService;
+import com.google.common.util.concurrent.MoreExecutors;
+import org.apache.druid.java.util.http.client.HttpClient;
+import org.apache.druid.java.util.http.client.Request;
+import org.apache.druid.java.util.http.client.response.HttpResponseHandler;
+import org.apache.druid.server.coordination.DataSegmentChangeCallback;
+import org.apache.druid.server.coordination.DataSegmentChangeHandler;
+import org.apache.druid.server.coordination.DataSegmentChangeRequest;
+import org.apache.druid.server.coordination.SegmentLoadDropHandler;
+import org.jboss.netty.buffer.ChannelBuffers;
+import org.jboss.netty.handler.codec.http.DefaultHttpResponse;
+import org.jboss.netty.handler.codec.http.HttpResponse;
+import org.jboss.netty.handler.codec.http.HttpResponseStatus;
+import org.jboss.netty.handler.codec.http.HttpVersion;
+import org.joda.time.Duration;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class TestSegmentLoadingHttpClient implements HttpClient
+{
+  private static final HttpResponseHandler.TrafficCop NOOP_TRAFFIC_COP = 
checkNum -> 0L;
+  private static final DataSegmentChangeCallback NOOP_CALLBACK = () -> {
+  };
+
+  private final ObjectMapper objectMapper;
+  private final Function<String, DataSegmentChangeHandler> hostToHandler;
+
+  private final ListeningScheduledExecutorService executorService;
+
+  public TestSegmentLoadingHttpClient(
+      ObjectMapper objectMapper,
+      Function<String, DataSegmentChangeHandler> hostToHandler,
+      ScheduledExecutorService executorService
+  )
+  {
+    this.objectMapper = objectMapper;
+    this.hostToHandler = hostToHandler;
+    this.executorService = MoreExecutors.listeningDecorator(executorService);
+  }
+
+  @Override
+  public <Intermediate, Final> ListenableFuture<Final> go(
+      Request request,
+      HttpResponseHandler<Intermediate, Final> handler
+  )
+  {
+    throw new UnsupportedOperationException();

Review Comment:
   Perhaps overly defensive, but I think that this can be an UOE with a message 
about all expected usages going through the 3-argument call.  That way, if this 
actually does end up getting called at some point in time, the developer will 
have an idea for what assumption broke.



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/README.md:
##########
@@ -0,0 +1,141 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+# Coordinator simulations
+
+The simulation framework allows developers to recreate arbitrary cluster 
setups and verify coordinator behaviour. Tests
+written using the framework can also help identify performance bottlenecks or 
potential bugs in the system and even
+compare different balancing strategies.
+
+As opposed to unit tests, simulations are meant to test the coordinator as a 
whole and verify the interactions of all
+the underlying parts. In that regard, these simulations resemble integration 
tests more closely.
+
+## Test targets
+
+The primary test target is the `DruidCoordinator` itself. The behaviour of the 
following entities can also be verified
+using simulations:
+
+- `LoadQueuePeon`, `LoadQueueTaskMaster`
+- All coordinator duties, e.g. `BalanceSegments`, `RunRules`
+- All retention rules
+
+## Capabilities
+
+The framework provides control over the following aspects of the setup:
+
+| Input | Details | Actions |
+|-------|---------|---------|
+|cluster | server name, type, tier, size | add a server, remove a server|
+|segment |datasource, interval, version, partition num, size | add/remove from 
server, mark used/unused, publish new segments|
+|rules | type (foreverLoad, drop, etc), replica count per tier | set rules for 
a datasource| 
+|configs |coordinator period, load queue type, load queue size, max segments 
to balance | set or update a config |
+
+The above actions can be performed at any point after building the simulation. 
So, you could even recreate scenarios
+where during a coordinator run, a server crashes or the retention rules of a 
datasource change, and verify the behaviour
+of the coordinator in these situations.
+
+## Design
+
+1. __Execution__: A tight dependency on time durations such as the period of a 
repeating task or the delay before a
+   scheduled task makes it difficult to reliably reproduce a test scenario. As 
a result, the tests become flaky. Thus,
+   all the executors required for coordinator operations have been allowed 
only two possible modes of execution:
+    - __immediate__: Execute tasks on the calling thread itself.
+    - __blocked__: Keep tasks in a queue until explicitly invoked.
+2. __Internal dependencies__: In order to allow realistic reproductions of the 
coordinator behaviour, none of the
+   internal parts of the coordinator have been mocked in the framework and new 
tests need not mock anything at all.
+3. __External dependencies__: Since these tests are meant to verify the 
behaviour of only the coordinator, the
+   interfaces to communicate with external dependencies have been provided as 
simple in-memory implementations:
+    - communication with metadata store: `SegmentMetadataManager`, 
`MetadataRuleManager`
+    - communication with historicals: `HttpClient`, `ServerInventoryView`
+4. __Inventory__: The coordinator maintains an inventory view of the cluster 
state. Simulations can choose from two
+   modes of inventory update - auto and manual. In auto update mode, any 
change made to the cluster is immediately
+   reflected in the inventory view. In manual update mode, the inventory must 
be explicitly synchronized with the
+   cluster state.
+
+## Limitations
+
+- The framework does not expose the coordinator HTTP endpoints.
+- It should not be used to verify the absolute values of execution latencies, 
e.g. the time taken to compute the
+  balancing cost of a segment. But the relative values can still be a good 
indicator while doing comparisons between,
+  say two balancing strategies.

Review Comment:
   What's wrong with trying to use it to benchmark the execution latencies of 
different balancing strategies?  
   
   Or.  What's the different between "verify the absolute values of execution 
latencies" and "be a good indicator while doing comparisons between, say, two 
balancing strategies"?



##########
server/src/main/java/org/apache/druid/server/coordinator/duty/BalanceSegments.java:
##########
@@ -92,6 +92,7 @@ private void balanceTier(
   )
   {
 
+    log.info("Balancing segments in tier [%s]", tier);

Review Comment:
   Is there any more information that can be added to this?  Having just the 
fact that the balancing occurred is useful, but if we can like add sizes or 
anything else that might be nice to have when trying to understand what 
happened, that can make it even more useful.



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/CoordinatorSimulationBaseTest.java:
##########
@@ -0,0 +1,302 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import org.apache.druid.client.DruidServer;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.emitter.core.Event;
+import org.apache.druid.server.coordination.ServerType;
+import org.apache.druid.server.coordinator.CoordinatorDynamicConfig;
+import org.apache.druid.server.coordinator.CreateDataSegments;
+import org.apache.druid.server.coordinator.rules.ForeverLoadRule;
+import org.apache.druid.server.coordinator.rules.Rule;
+import org.apache.druid.timeline.DataSegment;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base test for coordinator simulations.
+ * <p>
+ * Each test must call {@link #startSimulation(CoordinatorSimulation)} to start
+ * the simulation. {@link CoordinatorSimulation#stop()} should not be called as
+ * the simulation is stopped when cleaning up after the test in {@link 
#tearDown()}.
+ * <p>
+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.

Review Comment:
   Part of me wonders if this comment doesn't (also?) belong on 
`createDynamicConfig`?



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/CoordinatorSimulationBaseTest.java:
##########
@@ -0,0 +1,299 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import org.apache.druid.client.DruidServer;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.emitter.core.Event;
+import org.apache.druid.server.coordination.ServerType;
+import org.apache.druid.server.coordinator.CoordinatorDynamicConfig;
+import org.apache.druid.server.coordinator.CreateDataSegments;
+import org.apache.druid.server.coordinator.rules.ForeverLoadRule;
+import org.apache.druid.server.coordinator.rules.Rule;
+import org.apache.druid.timeline.DataSegment;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base test for coordinator simulations.
+ * <p>
+ * Each test must call {@link #startSimulation(CoordinatorSimulation)} to start
+ * the simulation. {@link CoordinatorSimulation#stop()} should not be called as
+ * the simulation is stopped when cleaning up after the test in {@link 
#tearDown()}.
+ * <p>
+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.
+ */
+public abstract class CoordinatorSimulationBaseTest

Review Comment:
   You could also have the fixture and then also have a BaseTest implemented 
using the fixture.  Then you are effectively actually using a composable 
fixture for things (and people can fall back to that if need be), but still 
don't have to repeat `fixture.` in all of the places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to