clintropolis commented on code in PR #18939:
URL: https://github.com/apache/druid/pull/18939#discussion_r2795858453


##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/NativeCompactionRunner.java:
##########
@@ -89,6 +89,14 @@ public CompactionConfigValidationResult 
validateCompactionTask(
       Map<Interval, DataSchema> intervalDataSchemaMap
   )
   {
+    // Virtual columns in filter rules are not supported by native compaction
+    if (compactionTask.getTransformSpec() != null
+        && compactionTask.getTransformSpec().getVirtualColumns() != null
+        && 
compactionTask.getTransformSpec().getVirtualColumns().getVirtualColumns().length
 > 0) {

Review Comment:
   could also consider using `isEmpty()`



##########
processing/src/main/java/org/apache/druid/segment/transform/CompactionTransformSpec.java:
##########
@@ -46,17 +47,20 @@ public static CompactionTransformSpec of(@Nullable 
TransformSpec transformSpec)
       return null;
     }
 
-    return new CompactionTransformSpec(transformSpec.getFilter());
+    return new CompactionTransformSpec(transformSpec.getFilter(), null);
   }
 
   @Nullable private final DimFilter filter;
+  @Nullable private final VirtualColumns virtualColumns;
 
   @JsonCreator
   public CompactionTransformSpec(
-      @JsonProperty("filter") final DimFilter filter
+      @JsonProperty("filter") final DimFilter filter,
+      @JsonProperty("virtualColumns") final VirtualColumns virtualColumns

Review Comment:
   this feels like a weird place for this to me but maybe i keep getting hung 
up on this just because of how stuff is laid out internally where transform 
spec filters and transforms happen at a layer below where virtual columns 
operate.
   
   i think really my main beef is really with DataSourceCompactionConfig, i'd 
like to change it eventually with something that looks more like the v10 
layout, where the base table schema is also defined as a projection, so that 
base table schema would be where the filter and virtual column live in that 
world and the stuff would be the set of projections to build and then the 
context/segment granularity/ioconfig/tuningconfig. 
   
   All that said, this is fine for now i think...



##########
server/src/main/java/org/apache/druid/server/compaction/InlineReindexingRuleProvider.java:
##########
@@ -0,0 +1,421 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.common.config.Configs;
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+
+/**
+ * Rule provider that returns a static list of rules defined inline in the 
configuration.
+ * <p>
+ * This is the simplest provider implementation, suitable for testing and use 
cases where the number of rules is
+ * relatively small and can be defined directly in the supervisor spec.
+ * <p>
+ * When filtering rules by interval, this provider only returns rules where 
{@link ReindexingRule#appliesTo(Interval, DateTime)}
+ * returns {@link ReindexingRule.AppliesToMode#FULL}. Rules with partial or no 
overlap are excluded.
+ * <p>
+ * For non-additive rule types, when multiple rules fully match an interval, 
only the rule with the oldest threshold
+ * (largest period) is returned. For example, if both a P30D and P90D 
granularity rule match an interval, the P90D
+ * rule is selected because it has the oldest threshold (now - 90 days is 
older than now - 30 days).
+ * <p>
+ * Example usage:
+ * <pre>{@code
+ * {
+ *   "type": "inline",
+ *   "reindexingDeletionRules": [
+ *     {
+ *       "id": "remove-bots-90d",
+ *       "olderThan": "P90D",
+ *       "deleteWhere": {
+ *         "type": "not",
+ *         "field": {
+ *           "type": "equals",
+ *           "column": "is_bot",
+ *           "matchValueType": "STRING"
+ *           "matchValue": "true"
+ *         }
+ *       },
+ *       "description": "Remove bot traffic from segments older than 90 days"
+ *     },
+ *     {
+ *       "id": "remove-low-priority-180d",
+ *       "olderThan": "P180D",
+ *       "deleteWhere": {
+ *         "type": "not",
+ *         "field": {
+ *           {
+ *             "type": "inType",
+ *             "column": "priority",
+ *             "matchValueType": "STRING",
+ *             "sortedValues": ["low", "spam"]
+ *           }
+ *         }
+ *       },
+ *       "description": "Remove low-priority data from segments older than 180 
days"
+ *     }
+ *   ]
+ * }
+ * }</pre>
+ */
+public class InlineReindexingRuleProvider implements ReindexingRuleProvider
+{
+  public static final String TYPE = "inline";
+
+  private final List<ReindexingDeletionRule> reindexingDeletionRules;
+  private final List<ReindexingMetricsRule> reindexingMetricsRules;
+  private final List<ReindexingDimensionsRule> reindexingDimensionsRules;
+  private final List<ReindexingIOConfigRule> reindexingIOConfigRules;
+  private final List<ReindexingProjectionRule> reindexingProjectionRules;
+  private final List<ReindexingSegmentGranularityRule> 
reindexingSegmentGranularityRules;
+  private final List<ReindexingQueryGranularityRule> 
reindexingQueryGranularityRules;
+  private final List<ReindexingTuningConfigRule> reindexingTuningConfigRules;
+
+
+  @JsonCreator
+  public InlineReindexingRuleProvider(
+      @JsonProperty("reindexingDeletionRules") @Nullable 
List<ReindexingDeletionRule> reindexingDeletionRules,
+      @JsonProperty("reindexingMetricsRules") @Nullable 
List<ReindexingMetricsRule> reindexingMetricsRules,
+      @JsonProperty("reindexingDimensionsRules") @Nullable 
List<ReindexingDimensionsRule> reindexingDimensionsRules,
+      @JsonProperty("reindexingIOConfigRules") @Nullable 
List<ReindexingIOConfigRule> reindexingIOConfigRules,
+      @JsonProperty("reindexingProjectionRules") @Nullable 
List<ReindexingProjectionRule> reindexingProjectionRules,
+      @JsonProperty("reindexingSegmentGranularityRules") @Nullable 
List<ReindexingSegmentGranularityRule> reindexingSegmentGranularityRules,
+      @JsonProperty("reindexingQueryGranularityRules") @Nullable 
List<ReindexingQueryGranularityRule> reindexingQueryGranularityRules,
+      @JsonProperty("reindexingTuningConfigRules") @Nullable 
List<ReindexingTuningConfigRule> reindexingTuningConfigRules
+  )

Review Comment:
   also, should we just drop the `reindexing` prefix from all these names?



##########
server/src/main/java/org/apache/druid/server/compaction/InlineReindexingRuleProvider.java:
##########
@@ -0,0 +1,421 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.common.config.Configs;
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+
+/**
+ * Rule provider that returns a static list of rules defined inline in the 
configuration.
+ * <p>
+ * This is the simplest provider implementation, suitable for testing and use 
cases where the number of rules is
+ * relatively small and can be defined directly in the supervisor spec.
+ * <p>
+ * When filtering rules by interval, this provider only returns rules where 
{@link ReindexingRule#appliesTo(Interval, DateTime)}
+ * returns {@link ReindexingRule.AppliesToMode#FULL}. Rules with partial or no 
overlap are excluded.
+ * <p>
+ * For non-additive rule types, when multiple rules fully match an interval, 
only the rule with the oldest threshold
+ * (largest period) is returned. For example, if both a P30D and P90D 
granularity rule match an interval, the P90D
+ * rule is selected because it has the oldest threshold (now - 90 days is 
older than now - 30 days).
+ * <p>
+ * Example usage:
+ * <pre>{@code
+ * {
+ *   "type": "inline",
+ *   "reindexingDeletionRules": [
+ *     {
+ *       "id": "remove-bots-90d",
+ *       "olderThan": "P90D",
+ *       "deleteWhere": {
+ *         "type": "not",
+ *         "field": {
+ *           "type": "equals",
+ *           "column": "is_bot",
+ *           "matchValueType": "STRING"
+ *           "matchValue": "true"
+ *         }
+ *       },
+ *       "description": "Remove bot traffic from segments older than 90 days"
+ *     },
+ *     {
+ *       "id": "remove-low-priority-180d",
+ *       "olderThan": "P180D",
+ *       "deleteWhere": {
+ *         "type": "not",
+ *         "field": {
+ *           {
+ *             "type": "inType",
+ *             "column": "priority",
+ *             "matchValueType": "STRING",
+ *             "sortedValues": ["low", "spam"]
+ *           }
+ *         }
+ *       },
+ *       "description": "Remove low-priority data from segments older than 180 
days"
+ *     }
+ *   ]
+ * }
+ * }</pre>
+ */
+public class InlineReindexingRuleProvider implements ReindexingRuleProvider
+{
+  public static final String TYPE = "inline";
+
+  private final List<ReindexingDeletionRule> reindexingDeletionRules;
+  private final List<ReindexingMetricsRule> reindexingMetricsRules;
+  private final List<ReindexingDimensionsRule> reindexingDimensionsRules;
+  private final List<ReindexingIOConfigRule> reindexingIOConfigRules;
+  private final List<ReindexingProjectionRule> reindexingProjectionRules;
+  private final List<ReindexingSegmentGranularityRule> 
reindexingSegmentGranularityRules;
+  private final List<ReindexingQueryGranularityRule> 
reindexingQueryGranularityRules;
+  private final List<ReindexingTuningConfigRule> reindexingTuningConfigRules;
+
+
+  @JsonCreator
+  public InlineReindexingRuleProvider(
+      @JsonProperty("reindexingDeletionRules") @Nullable 
List<ReindexingDeletionRule> reindexingDeletionRules,
+      @JsonProperty("reindexingMetricsRules") @Nullable 
List<ReindexingMetricsRule> reindexingMetricsRules,
+      @JsonProperty("reindexingDimensionsRules") @Nullable 
List<ReindexingDimensionsRule> reindexingDimensionsRules,
+      @JsonProperty("reindexingIOConfigRules") @Nullable 
List<ReindexingIOConfigRule> reindexingIOConfigRules,
+      @JsonProperty("reindexingProjectionRules") @Nullable 
List<ReindexingProjectionRule> reindexingProjectionRules,
+      @JsonProperty("reindexingSegmentGranularityRules") @Nullable 
List<ReindexingSegmentGranularityRule> reindexingSegmentGranularityRules,
+      @JsonProperty("reindexingQueryGranularityRules") @Nullable 
List<ReindexingQueryGranularityRule> reindexingQueryGranularityRules,
+      @JsonProperty("reindexingTuningConfigRules") @Nullable 
List<ReindexingTuningConfigRule> reindexingTuningConfigRules

Review Comment:
   while very flexible, i think this might be a bit overly expressive to the 
point that things might be hard to reason about. How about consolidating 
`ReindexingDimensionsRule`, `ReindexingMetricsRule`, 
`ReindexingProjectionRule`, and `ReindexingQueryGranularityRule` into a single 
`ReindexingSchemaRule` that has the option to set all of these things in one 
shot. I think that will make it a lot easier for operators to keep on top of 
what is actually going to happen during reindexing, and still should be 
adequately expressive.



##########
indexing-service/src/main/java/org/apache/druid/indexing/compact/ReindexingDeletionRuleOptimizer.java:
##########
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.compact;
+
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.query.filter.DimFilter;
+import org.apache.druid.query.filter.NotDimFilter;
+import org.apache.druid.query.filter.OrDimFilter;
+import org.apache.druid.segment.VirtualColumn;
+import org.apache.druid.segment.VirtualColumns;
+import org.apache.druid.segment.metadata.IndexingStateFingerprintMapper;
+import org.apache.druid.server.compaction.CompactionCandidate;
+import org.apache.druid.server.compaction.ReindexingDeletionRule;
+import org.apache.druid.timeline.CompactionState;
+import org.apache.druid.timeline.DataSegment;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Optimization utilities for applying {@link ReindexingDeletionRule}s during 
reindexing
+ * <p>
+ * When reindexing with {@link ReindexingDeletionRule}s, it is possible that 
candidate
+ * segments have already applied some or all of the deletion rules in previous 
reindexing runs. Reapplying such rules would
+ * be wasteful and redundant. This class provides funcionality to optimize the 
set of rules to be applied by
+ * any given reindexing task.
+ */
+public class ReindexingDeletionRuleOptimizer

Review Comment:
   the naming of this makes it more sound like a 'thing' instead of utility 
methods. I wonder should this just be reworked into an implementation of 
`ReindexingConfigFinalizer` instead of a separate class? 



##########
indexing-service/src/main/java/org/apache/druid/indexing/compact/ReindexingConfigFinalizer.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.compact;
+
+import org.apache.druid.server.compaction.CompactionCandidate;
+import org.apache.druid.server.coordinator.DataSourceCompactionConfig;
+
+/**
+ * Functional interface for customizing a {@link DataSourceCompactionConfig} 
for a specific
+ * {@link CompactionCandidate} before creating a reindexing job. This allows 
template-specific
+ * logic to be injected without hardcoding behavior in {@link 
CompactionConfigBasedJobTemplate}.
+ * <p>
+ * For example, cascading reindexing templates can use this to optimize filter 
rules based on
+ * the candidate's indexing state, while simpler templates can use the 
identity finalizer.
+ */
+@FunctionalInterface
+public interface ReindexingConfigFinalizer

Review Comment:
   (naming is the worst) i wonder if 'specialize' is a better way to refer to 
this than 'finalizer'. 
   
   Also, I can't help but wonder if this should be part of the 
`ReindexingRuleProvider` since it also handles doing transforms to compaction 
configs, and it would allow implementations the ability to do per candidate 
transforms that are beyond what is built-in to the cascading template.
   
   That said, if we think that we will always want to do the same set of rules 
for cascading template, this is probably fine too



##########
server/src/main/java/org/apache/druid/server/compaction/ReindexingConfigBuilder.java:
##########
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import org.apache.druid.java.util.common.granularity.Granularity;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.query.filter.DimFilter;
+import org.apache.druid.query.filter.NotDimFilter;
+import org.apache.druid.query.filter.OrDimFilter;
+import org.apache.druid.segment.VirtualColumn;
+import org.apache.druid.segment.VirtualColumns;
+import org.apache.druid.segment.transform.CompactionTransformSpec;
+import 
org.apache.druid.server.coordinator.InlineSchemaDataSourceCompactionConfig;
+import org.apache.druid.server.coordinator.UserCompactionTaskGranularityConfig;
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+
+/**
+ * Builds compaction configs by applying reindexing rules.
+ * Encapsulates the logic for combining additive rules and applying all rule 
types.
+ */
+public class ReindexingConfigBuilder

Review Comment:
   I sort of wonder if we need this, or if it should just be built into a apply 
method of ReindexingRuleProvider so that it can fully decide how to transform 
the InlineSchemaDataSourceCompactionConfig.Builder. Then again, maybe it is 
fine if we imagine that all rule providers will always do the transform in the 
same way?



##########
server/src/main/java/org/apache/druid/server/compaction/ReindexingRule.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+import org.joda.time.Period;
+
+import javax.annotation.Nullable;
+
+/**
+ * Defines a reindexing configuration that applies to data based on age 
thresholds.
+ * <p>
+ * Rules encapsulate specific aspects of reindexing (granularity, filters, 
tuning, etc.)
+ * and specify when they should apply via a {@link Period} which defines the 
age threshold for applicability.
+ * <p>
+ * Rules conditionally apply to data "older than" the rules threshold relative 
to the time of rule evaluation.
+ */
+public interface ReindexingRule
+{
+  /**
+   * Indicates how a rule applies to a given time interval based on the rule's 
period threshold.
+   * <ul>
+   * <li>PARTIAL: The rule applies to part of the interval.</li>
+   * <li>FULL: The rule applies to the entire interval.</li>
+   * <li>NONE: The rule does not apply to the interval at all.</li>
+   * </ul>
+   */
+  enum AppliesToMode
+  {
+    PARTIAL,
+    FULL,
+    NONE
+  }
+
+  String getId();
+
+  String getDescription();
+
+  Period getOlderThan();

Review Comment:
   nit: javadocs, at least for `getOlderThan`



##########
server/src/main/java/org/apache/druid/server/compaction/ReindexingRule.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+import org.joda.time.Period;
+
+import javax.annotation.Nullable;
+
+/**
+ * Defines a reindexing configuration that applies to data based on age 
thresholds.
+ * <p>
+ * Rules encapsulate specific aspects of reindexing (granularity, filters, 
tuning, etc.)
+ * and specify when they should apply via a {@link Period} which defines the 
age threshold for applicability.
+ * <p>
+ * Rules conditionally apply to data "older than" the rules threshold relative 
to the time of rule evaluation.
+ */
+public interface ReindexingRule

Review Comment:
   gut reaction here was is any reason for 'apply' to not live on this 
interface to do the decoration of the config builder directly? On further 
thought, I guess it doesn't provide a lot of benefit since we would still need 
to like, group them by rule type so that we can choose which one(s) of each 
type to apply, and also probably encode whether or not they are additive to 
decide how to apply each type



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to