[GitHub] [storm] srdo commented on a change in pull request #3092: STORM-3474 Large fragmented cluster scheduling time test

GitBox Thu, 01 Aug 2019 03:06:32 -0700

srdo commented on a change in pull request #3092: STORM-3474 Large fragmented 
cluster scheduling time test
URL: https://github.com/apache/storm/pull/3092#discussion_r309610927


 ##########
 File path: 
storm-server/src/test/java/org/apache/storm/scheduler/resource/TestResourceAwareScheduler.java
 ##########
 @@ -1067,6 +1068,208 @@ public void minCpuWorkerSplitFails() {
         assertFalse("Topo-1 unscheduled?", 
cluster.getAssignmentById(topo1.getId()) != null);
     }
 
+    protected static class TimeBlockResult {
+        long firstBlockTime;
+        long lastBlockTime;
+    }
+
+    private long getMedianBlockTime(TimeBlockResult[] runResults, boolean 
firstBlock) {
+        final int numRuns = runResults.length;
+        assert(numRuns % 2 == 1);     // number of runs must be odd to compute 
median as below
+        long[] times = new long[numRuns];
+        for (int i = 0; i < numRuns; ++i) {
+            times[i] = firstBlock ? runResults[i].firstBlockTime : 
runResults[i].lastBlockTime;
+        }
+        Arrays.sort(times);
+
+        final int medianIndex = (int) Math.floor(numRuns / 2);
+        return times[medianIndex];
+    }
+
+    /**
+     * Check time to schedule a fragmented cluster using different strategies
+     *
+     * Simulate scheduling on a large production cluster. Find the ratio of 
time to schedule a set of topologies when
+     * the cluster is empty and when the cluster is nearly full. While the 
cluster has sufficient resources to schedule
+     * all topologies, when nearly full the cluster becomes fragmented and 
some topologies fail to schedule.
+     */
+    @Test
+    public void TestLargeFragmentedClusterScheduling() {
+        /*
+        Without fragmentation, the cluster would be able to schedule both 
topologies on each node. Let's call each node
+        with both topologies scheduled as 100% scheduled.
+
+        We schedule the cluster in 3 blocks of topologies, measuring the time 
to schedule the blocks. The first, middle
+        and last blocks attempt to schedule the following 0-10%, 10%-90%, 
90%-100%. The last block has a number of
+        scheduling failures due to cluster fragmentation and its time is 
dominated by attempting to evict topologies.
+
+        Timing results for scheduling are noisy. As a result, we do multiple 
runs and use median values for FirstBlock
+        and LastBlock times. (somewhere a statistician is crying). The ratio 
of LastBlock / FirstBlock remains fairly constant.
+
+
+        TestLargeFragmentedClusterScheduling took 91118 ms
+        DefaultResourceAwareStrategy, FirstBlock 249.0, LastBlock 1734.0 ratio 
6.963855421686747
+        GenericResourceAwareStrategy, FirstBlock 215.0, LastBlock 1673.0 ratio 
7.78139534883721
+        ConstraintSolverStrategy, FirstBlock 279.0, LastBlock 2200.0 ratio 
7.885304659498208
+
+        TestLargeFragmentedClusterScheduling took 98455 ms
+        DefaultResourceAwareStrategy, FirstBlock 266.0, LastBlock 1812.0 ratio 
6.81203007518797
+        GenericResourceAwareStrategy, FirstBlock 235.0, LastBlock 1802.0 ratio 
7.6680851063829785
+        ConstraintSolverStrategy, FirstBlock 304.0, LastBlock 2320.0 ratio 
7.631578947368421
+
+        TestLargeFragmentedClusterScheduling took 97268 ms
+        DefaultResourceAwareStrategy, FirstBlock 251.0, LastBlock 1826.0 ratio 
7.274900398406374
+        GenericResourceAwareStrategy, FirstBlock 220.0, LastBlock 1719.0 ratio 
7.8136363636363635
+        ConstraintSolverStrategy, FirstBlock 296.0, LastBlock 2469.0 ratio 
8.341216216216216
+
+        TestLargeFragmentedClusterScheduling took 97963 ms
+        DefaultResourceAwareStrategy, FirstBlock 249.0, LastBlock 1788.0 ratio 
7.180722891566265
+        GenericResourceAwareStrategy, FirstBlock 240.0, LastBlock 1796.0 ratio 
7.483333333333333
+        ConstraintSolverStrategy, FirstBlock 328.0, LastBlock 2544.0 ratio 
7.7560975609756095
+
+        TestLargeFragmentedClusterScheduling took 93106 ms
+        DefaultResourceAwareStrategy, FirstBlock 258.0, LastBlock 1714.0 ratio 
6.6434108527131785
+        GenericResourceAwareStrategy, FirstBlock 215.0, LastBlock 1692.0 ratio 
7.869767441860465
+        ConstraintSolverStrategy, FirstBlock 309.0, LastBlock 2342.0 ratio 
7.5792880258899675
+
+        Choose the median value of the values above
+        DefaultResourceAwareStrategy    6.96
+        GenericResourceAwareStrategy    7.78
+        ConstraintSolverStrategy        7.75
+        */
+
+        final int numNodes = 500;
+        final String[] strategies = new String[]{
+                DefaultResourceAwareStrategy.class.getName(),
+                GenericResourceAwareStrategy.class.getName(),
+                ConstraintSolverStrategy.class.getName()
+        };
+
+        final int numStrategies = strategies.length;
+        final int numRuns = 5;
+        TimeBlockResult testResults[][] = new 
TimeBlockResult[numStrategies][numRuns];
+
+        // Get first and last block times for multiple runs and strategies
+        long startTime = Time.currentTimeMillis();
+        for (int strategyIdx = 0; strategyIdx < numStrategies; ++strategyIdx) {
+            String strategy = strategies[strategyIdx];
+
+            for (int run = 0; run < numRuns; ++run) {
+                testResults[strategyIdx][run] = 
testLargeClusterSchedulingTiming(numNodes, strategy);
+            }
+        }
+
+        // Log median ratios for different strategies
+        LOG.info("TestLargeFragmentedClusterScheduling took {} ms", 
Time.currentTimeMillis() - startTime);
+        for (int strategyIdx = 0; strategyIdx < numStrategies; ++strategyIdx) {
+            double medianFirstBlockTime = 
getMedianBlockTime(testResults[strategyIdx], true);
+            double medianLastBlockTime = 
getMedianBlockTime(testResults[strategyIdx], false);
+            double ratio = medianLastBlockTime / medianFirstBlockTime;
+            LOG.info("{}, FirstBlock {}, LastBlock {} ratio {}", 
strategies[strategyIdx], medianFirstBlockTime, medianLastBlockTime, ratio);
+        }
+
+        // Check last block scheduling time does not get significantly slower
+        final double[] acceptedStrategyTimeRatios = {6.96, 7.78, 7.75};
+        for (int strategyIdx = 0; strategyIdx < numStrategies; ++strategyIdx) {
+            double medianFirstBlockTime = 
getMedianBlockTime(testResults[strategyIdx], true);
+            double medianLastBlockTime = 
getMedianBlockTime(testResults[strategyIdx], false);
+            double ratio = medianLastBlockTime / medianFirstBlockTime;
+
+            double slowSchedulingThreshold = 1.5;
+            assert(ratio < slowSchedulingThreshold * 
acceptedStrategyTimeRatios[strategyIdx]);
 
 Review comment:
   I think you should use JUnit/Hamcrest assertThat here, and put an error 
message in that contains the strategy name. Otherwise when this fails you don't 
know which strategy is slow, or what the numbers causing the failure were.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [storm] srdo commented on a change in pull request #3092: STORM-3474 Large fragmented cluster scheduling time test

Reply via email to