tarun11Mavani commented on code in PR #17126:
URL: https://github.com/apache/pinot/pull/17126#discussion_r2489640817
##########
pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/minion/generator/BaseTaskGenerator.java:
##########
@@ -217,4 +220,52 @@ public Map<String, String> getBaseTaskConfigs(TableConfig
tableConfig, List<Stri
MinionConstants.SEGMENT_NAME_SEPARATOR));
return baseConfigs;
}
+
+ /**
+ * Selects random items from a sorted list using reservoir sampling for
efficiency.
+ * This method is useful for segment selection with randomization to avoid
contention.
+ *
+ * @param sortedItems List of items paired with their priority values (e.g.,
invalid record count)
+ * @param maxItems Maximum number of items to select
+ * @param randomizationFactor Factor to expand candidate pool (e.g., 2.0 =
select from top 2x items)
+ * @param <T> Type of items to select
+ * @return List of randomly selected items
+ */
+ public static <T> List<T> selectRandomItems(List<Pair<T, Long>> sortedItems,
+ int maxItems, double randomizationFactor) {
+ if (randomizationFactor <= 1.0 || maxItems <= 0 || sortedItems.isEmpty()) {
+ // No randomization, return top items
+ return
sortedItems.stream().limit(maxItems).map(Pair::getKey).collect(Collectors.toList());
+ }
+
+ // Calculate expanded candidate pool size
+ int candidatePoolSize = Math.min((int) Math.ceil(maxItems *
randomizationFactor), sortedItems.size());
+
+ // Get top candidates based on the expanded pool size
+ List<Pair<T, Long>> candidates = sortedItems.subList(0, candidatePoolSize);
+
+ // Use reservoir sampling to efficiently select random items
+ List<Pair<T, Long>> selectedCandidates = new ArrayList<>(maxItems);
+ Random random = new Random();
Review Comment:
pseudo-random are fine as the table state and valid doc distribution is
constantly changing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]