clintropolis commented on code in PR #18939:
URL: https://github.com/apache/druid/pull/18939#discussion_r2795858453
##########
indexing-service/src/main/java/org/apache/druid/indexing/common/task/NativeCompactionRunner.java:
##########
@@ -89,6 +89,14 @@ public CompactionConfigValidationResult
validateCompactionTask(
Map<Interval, DataSchema> intervalDataSchemaMap
)
{
+ // Virtual columns in filter rules are not supported by native compaction
+ if (compactionTask.getTransformSpec() != null
+ && compactionTask.getTransformSpec().getVirtualColumns() != null
+ &&
compactionTask.getTransformSpec().getVirtualColumns().getVirtualColumns().length
> 0) {
Review Comment:
could also consider using `isEmpty()`
##########
processing/src/main/java/org/apache/druid/segment/transform/CompactionTransformSpec.java:
##########
@@ -46,17 +47,20 @@ public static CompactionTransformSpec of(@Nullable
TransformSpec transformSpec)
return null;
}
- return new CompactionTransformSpec(transformSpec.getFilter());
+ return new CompactionTransformSpec(transformSpec.getFilter(), null);
}
@Nullable private final DimFilter filter;
+ @Nullable private final VirtualColumns virtualColumns;
@JsonCreator
public CompactionTransformSpec(
- @JsonProperty("filter") final DimFilter filter
+ @JsonProperty("filter") final DimFilter filter,
+ @JsonProperty("virtualColumns") final VirtualColumns virtualColumns
Review Comment:
this feels like a weird place for this to me but maybe i keep getting hung
up on this just because of how stuff is laid out internally where transform
spec filters and transforms happen at a layer below where virtual columns
operate.
i think really my main beef is really with DataSourceCompactionConfig, i'd
like to change it eventually with something that looks more like the v10
layout, where the base table schema is also defined as a projection, so that
base table schema would be where the filter and virtual column live in that
world and the stuff would be the set of projections to build and then the
context/segment granularity/ioconfig/tuningconfig.
All that said, this is fine for now i think...
##########
server/src/main/java/org/apache/druid/server/compaction/InlineReindexingRuleProvider.java:
##########
@@ -0,0 +1,421 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.common.config.Configs;
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+
+/**
+ * Rule provider that returns a static list of rules defined inline in the
configuration.
+ * <p>
+ * This is the simplest provider implementation, suitable for testing and use
cases where the number of rules is
+ * relatively small and can be defined directly in the supervisor spec.
+ * <p>
+ * When filtering rules by interval, this provider only returns rules where
{@link ReindexingRule#appliesTo(Interval, DateTime)}
+ * returns {@link ReindexingRule.AppliesToMode#FULL}. Rules with partial or no
overlap are excluded.
+ * <p>
+ * For non-additive rule types, when multiple rules fully match an interval,
only the rule with the oldest threshold
+ * (largest period) is returned. For example, if both a P30D and P90D
granularity rule match an interval, the P90D
+ * rule is selected because it has the oldest threshold (now - 90 days is
older than now - 30 days).
+ * <p>
+ * Example usage:
+ * <pre>{@code
+ * {
+ * "type": "inline",
+ * "reindexingDeletionRules": [
+ * {
+ * "id": "remove-bots-90d",
+ * "olderThan": "P90D",
+ * "deleteWhere": {
+ * "type": "not",
+ * "field": {
+ * "type": "equals",
+ * "column": "is_bot",
+ * "matchValueType": "STRING"
+ * "matchValue": "true"
+ * }
+ * },
+ * "description": "Remove bot traffic from segments older than 90 days"
+ * },
+ * {
+ * "id": "remove-low-priority-180d",
+ * "olderThan": "P180D",
+ * "deleteWhere": {
+ * "type": "not",
+ * "field": {
+ * {
+ * "type": "inType",
+ * "column": "priority",
+ * "matchValueType": "STRING",
+ * "sortedValues": ["low", "spam"]
+ * }
+ * }
+ * },
+ * "description": "Remove low-priority data from segments older than 180
days"
+ * }
+ * ]
+ * }
+ * }</pre>
+ */
+public class InlineReindexingRuleProvider implements ReindexingRuleProvider
+{
+ public static final String TYPE = "inline";
+
+ private final List<ReindexingDeletionRule> reindexingDeletionRules;
+ private final List<ReindexingMetricsRule> reindexingMetricsRules;
+ private final List<ReindexingDimensionsRule> reindexingDimensionsRules;
+ private final List<ReindexingIOConfigRule> reindexingIOConfigRules;
+ private final List<ReindexingProjectionRule> reindexingProjectionRules;
+ private final List<ReindexingSegmentGranularityRule>
reindexingSegmentGranularityRules;
+ private final List<ReindexingQueryGranularityRule>
reindexingQueryGranularityRules;
+ private final List<ReindexingTuningConfigRule> reindexingTuningConfigRules;
+
+
+ @JsonCreator
+ public InlineReindexingRuleProvider(
+ @JsonProperty("reindexingDeletionRules") @Nullable
List<ReindexingDeletionRule> reindexingDeletionRules,
+ @JsonProperty("reindexingMetricsRules") @Nullable
List<ReindexingMetricsRule> reindexingMetricsRules,
+ @JsonProperty("reindexingDimensionsRules") @Nullable
List<ReindexingDimensionsRule> reindexingDimensionsRules,
+ @JsonProperty("reindexingIOConfigRules") @Nullable
List<ReindexingIOConfigRule> reindexingIOConfigRules,
+ @JsonProperty("reindexingProjectionRules") @Nullable
List<ReindexingProjectionRule> reindexingProjectionRules,
+ @JsonProperty("reindexingSegmentGranularityRules") @Nullable
List<ReindexingSegmentGranularityRule> reindexingSegmentGranularityRules,
+ @JsonProperty("reindexingQueryGranularityRules") @Nullable
List<ReindexingQueryGranularityRule> reindexingQueryGranularityRules,
+ @JsonProperty("reindexingTuningConfigRules") @Nullable
List<ReindexingTuningConfigRule> reindexingTuningConfigRules
+ )
Review Comment:
also, should we just drop the `reindexing` prefix from all these names?
##########
server/src/main/java/org/apache/druid/server/compaction/InlineReindexingRuleProvider.java:
##########
@@ -0,0 +1,421 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import org.apache.druid.common.config.Configs;
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.List;
+import java.util.Objects;
+
+/**
+ * Rule provider that returns a static list of rules defined inline in the
configuration.
+ * <p>
+ * This is the simplest provider implementation, suitable for testing and use
cases where the number of rules is
+ * relatively small and can be defined directly in the supervisor spec.
+ * <p>
+ * When filtering rules by interval, this provider only returns rules where
{@link ReindexingRule#appliesTo(Interval, DateTime)}
+ * returns {@link ReindexingRule.AppliesToMode#FULL}. Rules with partial or no
overlap are excluded.
+ * <p>
+ * For non-additive rule types, when multiple rules fully match an interval,
only the rule with the oldest threshold
+ * (largest period) is returned. For example, if both a P30D and P90D
granularity rule match an interval, the P90D
+ * rule is selected because it has the oldest threshold (now - 90 days is
older than now - 30 days).
+ * <p>
+ * Example usage:
+ * <pre>{@code
+ * {
+ * "type": "inline",
+ * "reindexingDeletionRules": [
+ * {
+ * "id": "remove-bots-90d",
+ * "olderThan": "P90D",
+ * "deleteWhere": {
+ * "type": "not",
+ * "field": {
+ * "type": "equals",
+ * "column": "is_bot",
+ * "matchValueType": "STRING"
+ * "matchValue": "true"
+ * }
+ * },
+ * "description": "Remove bot traffic from segments older than 90 days"
+ * },
+ * {
+ * "id": "remove-low-priority-180d",
+ * "olderThan": "P180D",
+ * "deleteWhere": {
+ * "type": "not",
+ * "field": {
+ * {
+ * "type": "inType",
+ * "column": "priority",
+ * "matchValueType": "STRING",
+ * "sortedValues": ["low", "spam"]
+ * }
+ * }
+ * },
+ * "description": "Remove low-priority data from segments older than 180
days"
+ * }
+ * ]
+ * }
+ * }</pre>
+ */
+public class InlineReindexingRuleProvider implements ReindexingRuleProvider
+{
+ public static final String TYPE = "inline";
+
+ private final List<ReindexingDeletionRule> reindexingDeletionRules;
+ private final List<ReindexingMetricsRule> reindexingMetricsRules;
+ private final List<ReindexingDimensionsRule> reindexingDimensionsRules;
+ private final List<ReindexingIOConfigRule> reindexingIOConfigRules;
+ private final List<ReindexingProjectionRule> reindexingProjectionRules;
+ private final List<ReindexingSegmentGranularityRule>
reindexingSegmentGranularityRules;
+ private final List<ReindexingQueryGranularityRule>
reindexingQueryGranularityRules;
+ private final List<ReindexingTuningConfigRule> reindexingTuningConfigRules;
+
+
+ @JsonCreator
+ public InlineReindexingRuleProvider(
+ @JsonProperty("reindexingDeletionRules") @Nullable
List<ReindexingDeletionRule> reindexingDeletionRules,
+ @JsonProperty("reindexingMetricsRules") @Nullable
List<ReindexingMetricsRule> reindexingMetricsRules,
+ @JsonProperty("reindexingDimensionsRules") @Nullable
List<ReindexingDimensionsRule> reindexingDimensionsRules,
+ @JsonProperty("reindexingIOConfigRules") @Nullable
List<ReindexingIOConfigRule> reindexingIOConfigRules,
+ @JsonProperty("reindexingProjectionRules") @Nullable
List<ReindexingProjectionRule> reindexingProjectionRules,
+ @JsonProperty("reindexingSegmentGranularityRules") @Nullable
List<ReindexingSegmentGranularityRule> reindexingSegmentGranularityRules,
+ @JsonProperty("reindexingQueryGranularityRules") @Nullable
List<ReindexingQueryGranularityRule> reindexingQueryGranularityRules,
+ @JsonProperty("reindexingTuningConfigRules") @Nullable
List<ReindexingTuningConfigRule> reindexingTuningConfigRules
Review Comment:
while very flexible, i think this might be a bit overly expressive to the
point that things might be hard to reason about. How about consolidating
`ReindexingDimensionsRule`, `ReindexingMetricsRule`,
`ReindexingProjectionRule`, and `ReindexingQueryGranularityRule` into a single
`ReindexingSchemaRule` that has the option to set all of these things in one
shot. I think that will make it a lot easier for operators to keep on top of
what is actually going to happen during reindexing, and still should be
adequately expressive.
##########
indexing-service/src/main/java/org/apache/druid/indexing/compact/ReindexingDeletionRuleOptimizer.java:
##########
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.compact;
+
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.query.filter.DimFilter;
+import org.apache.druid.query.filter.NotDimFilter;
+import org.apache.druid.query.filter.OrDimFilter;
+import org.apache.druid.segment.VirtualColumn;
+import org.apache.druid.segment.VirtualColumns;
+import org.apache.druid.segment.metadata.IndexingStateFingerprintMapper;
+import org.apache.druid.server.compaction.CompactionCandidate;
+import org.apache.druid.server.compaction.ReindexingDeletionRule;
+import org.apache.druid.timeline.CompactionState;
+import org.apache.druid.timeline.DataSegment;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Optimization utilities for applying {@link ReindexingDeletionRule}s during
reindexing
+ * <p>
+ * When reindexing with {@link ReindexingDeletionRule}s, it is possible that
candidate
+ * segments have already applied some or all of the deletion rules in previous
reindexing runs. Reapplying such rules would
+ * be wasteful and redundant. This class provides funcionality to optimize the
set of rules to be applied by
+ * any given reindexing task.
+ */
+public class ReindexingDeletionRuleOptimizer
Review Comment:
the naming of this makes it more sound like a 'thing' instead of utility
methods. I wonder should this just be reworked into an implementation of
`ReindexingConfigFinalizer` instead of a separate class?
##########
indexing-service/src/main/java/org/apache/druid/indexing/compact/ReindexingConfigFinalizer.java:
##########
@@ -0,0 +1,55 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.compact;
+
+import org.apache.druid.server.compaction.CompactionCandidate;
+import org.apache.druid.server.coordinator.DataSourceCompactionConfig;
+
+/**
+ * Functional interface for customizing a {@link DataSourceCompactionConfig}
for a specific
+ * {@link CompactionCandidate} before creating a reindexing job. This allows
template-specific
+ * logic to be injected without hardcoding behavior in {@link
CompactionConfigBasedJobTemplate}.
+ * <p>
+ * For example, cascading reindexing templates can use this to optimize filter
rules based on
+ * the candidate's indexing state, while simpler templates can use the
identity finalizer.
+ */
+@FunctionalInterface
+public interface ReindexingConfigFinalizer
Review Comment:
(naming is the worst) i wonder if 'specialize' is a better way to refer to
this than 'finalizer'.
Also, I can't help but wonder if this should be part of the
`ReindexingRuleProvider` since it also handles doing transforms to compaction
configs, and it would allow implementations the ability to do per candidate
transforms that are beyond what is built-in to the cascading template.
That said, if we think that we will always want to do the same set of rules
for cascading template, this is probably fine too
##########
server/src/main/java/org/apache/druid/server/compaction/ReindexingConfigBuilder.java:
##########
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import org.apache.druid.java.util.common.granularity.Granularity;
+import org.apache.druid.java.util.common.logger.Logger;
+import org.apache.druid.query.filter.DimFilter;
+import org.apache.druid.query.filter.NotDimFilter;
+import org.apache.druid.query.filter.OrDimFilter;
+import org.apache.druid.segment.VirtualColumn;
+import org.apache.druid.segment.VirtualColumns;
+import org.apache.druid.segment.transform.CompactionTransformSpec;
+import
org.apache.druid.server.coordinator.InlineSchemaDataSourceCompactionConfig;
+import org.apache.druid.server.coordinator.UserCompactionTaskGranularityConfig;
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+
+import javax.annotation.Nullable;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.function.Consumer;
+import java.util.function.Function;
+
+/**
+ * Builds compaction configs by applying reindexing rules.
+ * Encapsulates the logic for combining additive rules and applying all rule
types.
+ */
+public class ReindexingConfigBuilder
Review Comment:
I sort of wonder if we need this, or if it should just be built into a apply
method of ReindexingRuleProvider so that it can fully decide how to transform
the InlineSchemaDataSourceCompactionConfig.Builder. Then again, maybe it is
fine if we imagine that all rule providers will always do the transform in the
same way?
##########
server/src/main/java/org/apache/druid/server/compaction/ReindexingRule.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+import org.joda.time.Period;
+
+import javax.annotation.Nullable;
+
+/**
+ * Defines a reindexing configuration that applies to data based on age
thresholds.
+ * <p>
+ * Rules encapsulate specific aspects of reindexing (granularity, filters,
tuning, etc.)
+ * and specify when they should apply via a {@link Period} which defines the
age threshold for applicability.
+ * <p>
+ * Rules conditionally apply to data "older than" the rules threshold relative
to the time of rule evaluation.
+ */
+public interface ReindexingRule
+{
+ /**
+ * Indicates how a rule applies to a given time interval based on the rule's
period threshold.
+ * <ul>
+ * <li>PARTIAL: The rule applies to part of the interval.</li>
+ * <li>FULL: The rule applies to the entire interval.</li>
+ * <li>NONE: The rule does not apply to the interval at all.</li>
+ * </ul>
+ */
+ enum AppliesToMode
+ {
+ PARTIAL,
+ FULL,
+ NONE
+ }
+
+ String getId();
+
+ String getDescription();
+
+ Period getOlderThan();
Review Comment:
nit: javadocs, at least for `getOlderThan`
##########
server/src/main/java/org/apache/druid/server/compaction/ReindexingRule.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.compaction;
+
+import org.joda.time.DateTime;
+import org.joda.time.Interval;
+import org.joda.time.Period;
+
+import javax.annotation.Nullable;
+
+/**
+ * Defines a reindexing configuration that applies to data based on age
thresholds.
+ * <p>
+ * Rules encapsulate specific aspects of reindexing (granularity, filters,
tuning, etc.)
+ * and specify when they should apply via a {@link Period} which defines the
age threshold for applicability.
+ * <p>
+ * Rules conditionally apply to data "older than" the rules threshold relative
to the time of rule evaluation.
+ */
+public interface ReindexingRule
Review Comment:
gut reaction here was is any reason for 'apply' to not live on this
interface to do the decoration of the config builder directly? On further
thought, I guess it doesn't provide a lot of benefit since we would still need
to like, group them by rule type so that we can choose which one(s) of each
type to apply, and also probably encode whether or not they are additive to
decide how to apply each type
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]