[ 
https://issues.apache.org/jira/browse/GOBBLIN-1809?focusedWorklogId=856242&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-856242
 ]

ASF GitHub Bot logged work on GOBBLIN-1809:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 11/Apr/23 20:32
            Start Date: 11/Apr/23 20:32
    Worklog Time Spent: 10m 
      Work Description: homatthew commented on code in PR #3670:
URL: https://github.com/apache/gobblin/pull/3670#discussion_r1163256521


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/version/finder/LookbackDateTimeDatasetVersionFinder.java:
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.data.management.version.finder;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTime;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.joda.time.Period;
+import org.joda.time.format.PeriodFormatter;
+import org.joda.time.format.PeriodFormatterBuilder;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+import com.typesafe.config.Config;
+
+import org.apache.gobblin.data.management.version.FileSystemDatasetVersion;
+import org.apache.gobblin.data.management.version.TimestampedDatasetVersion;
+import org.apache.gobblin.dataset.Dataset;
+import org.apache.gobblin.dataset.FileSystemDataset;
+import org.apache.gobblin.util.ConfigUtils;
+
+
+/**
+ * {@link DatasetVersionFinder} that constructs {@link 
TimestampedDatasetVersion}s without actually checking for existence
+ * of the version path. The version path is constructed by appending the 
version partition pattern to the dataset root.
+ * The versions are found by looking back a specific period of time and 
finding unique date partitions between that
+ * time and the current time.
+ */
+public class LookbackDateTimeDatasetVersionFinder extends 
DateTimeDatasetVersionFinder {
+  public static final String VERSION_PATH_PREFIX = "version.path.prefix";
+  public static final String VERSION_LOOKBACK_PERIOD = 
"version.lookback.period";
+
+  private final Duration stepDuration;
+  private final Period lookbackPeriod;
+  private final String pathPrefix;
+  private final Instant endTime;
+
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config) {
+    this(fs, config, Instant.now());
+  }
+
+  @VisibleForTesting
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config, 
Instant endTime) {
+    super(fs, config);
+    Preconditions.checkArgument(config.hasPath(VERSION_LOOKBACK_PERIOD) , 
"Missing required property " + VERSION_LOOKBACK_PERIOD);
+    PeriodFormatter periodFormatter =
+        new 
PeriodFormatterBuilder().appendYears().appendSuffix("y").appendMonths().appendSuffix("M").appendDays()
+            
.appendSuffix("d").appendHours().appendSuffix("h").appendMinutes().appendSuffix("m").toFormatter();
+    this.stepDuration = Duration.standardMinutes(1);
+    this.pathPrefix = ConfigUtils.getString(config, VERSION_PATH_PREFIX, "");
+    this.lookbackPeriod = 
periodFormatter.parsePeriod(config.getString(VERSION_LOOKBACK_PERIOD));
+    this.endTime = endTime;
+  }
+
+  @Override
+  public Class<? extends FileSystemDatasetVersion> versionClass() {

Review Comment:
   We are extending `DateTimeDatasetVersionFinder` which implements  
`AbstractDatasetVersionFinder<TimestampedDatasetVersion> `and already has the 
method 
   
   ```
     @Override
     public Class<? extends FileSystemDatasetVersion> versionClass() {
       return TimestampedDatasetVersion.class;
     }
   ```
   
   Is this method still required for compiling?



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/version/finder/LookbackDateTimeDatasetVersionFinder.java:
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.data.management.version.finder;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTime;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.joda.time.Period;
+import org.joda.time.format.PeriodFormatter;
+import org.joda.time.format.PeriodFormatterBuilder;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+import com.typesafe.config.Config;
+
+import org.apache.gobblin.data.management.version.FileSystemDatasetVersion;
+import org.apache.gobblin.data.management.version.TimestampedDatasetVersion;
+import org.apache.gobblin.dataset.Dataset;
+import org.apache.gobblin.dataset.FileSystemDataset;
+import org.apache.gobblin.util.ConfigUtils;
+
+
+/**
+ * {@link DatasetVersionFinder} that constructs {@link 
TimestampedDatasetVersion}s without actually checking for existence
+ * of the version path. The version path is constructed by appending the 
version partition pattern to the dataset root.
+ * The versions are found by looking back a specific period of time and 
finding unique date partitions between that
+ * time and the current time.
+ */
+public class LookbackDateTimeDatasetVersionFinder extends 
DateTimeDatasetVersionFinder {
+  public static final String VERSION_PATH_PREFIX = "version.path.prefix";
+  public static final String VERSION_LOOKBACK_PERIOD = 
"version.lookback.period";
+
+  private final Duration stepDuration;
+  private final Period lookbackPeriod;
+  private final String pathPrefix;
+  private final Instant endTime;
+
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config) {
+    this(fs, config, Instant.now());
+  }
+
+  @VisibleForTesting
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config, 
Instant endTime) {
+    super(fs, config);
+    Preconditions.checkArgument(config.hasPath(VERSION_LOOKBACK_PERIOD) , 
"Missing required property " + VERSION_LOOKBACK_PERIOD);
+    PeriodFormatter periodFormatter =
+        new 
PeriodFormatterBuilder().appendYears().appendSuffix("y").appendMonths().appendSuffix("M").appendDays()
+            
.appendSuffix("d").appendHours().appendSuffix("h").appendMinutes().appendSuffix("m").toFormatter();
+    this.stepDuration = Duration.standardMinutes(1);

Review Comment:
   We step through every minute because we support lookbacks to the minute? And 
is this because the parent class supports to the minute?



##########
gobblin-data-management/src/test/java/org/apache/gobblin/data/management/version/finder/LookbackDateTimeDatasetVersionFinderTest.java:
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.gobblin.data.management.version.finder;
+
+import java.util.Collection;
+import java.util.List;
+import java.util.Properties;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTimeZone;
+import org.joda.time.Instant;
+import org.joda.time.format.DateTimeFormat;
+import org.joda.time.format.DateTimeFormatter;
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+import org.apache.gobblin.configuration.ConfigurationKeys;
+import org.apache.gobblin.data.management.version.TimestampedDatasetVersion;
+import org.apache.gobblin.dataset.Dataset;
+import org.apache.gobblin.dataset.FileSystemDataset;
+import org.apache.gobblin.util.ConfigUtils;
+
+
+@Test(groups = { "gobblin.data.management.version" })
+public class LookbackDateTimeDatasetVersionFinderTest {
+
+  private FileSystem fs;
+  private DateTimeFormatter formatter = 
DateTimeFormat.forPattern("yyyy/MM/dd/HH").withZone(DateTimeZone.forID(ConfigurationKeys.PST_TIMEZONE_NAME));
+  private final Instant fixedTime = 
Instant.parse("2023-04-10T17:30:00.000-07:00");

Review Comment:
   Nit: 4-10-2023 is a bit of an abritrary date for future readers and could be 
jarring. Start of a month / year makes more sense to me, but it's NBD either way



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/version/finder/LookbackDateTimeDatasetVersionFinder.java:
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.data.management.version.finder;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTime;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.joda.time.Period;
+import org.joda.time.format.PeriodFormatter;
+import org.joda.time.format.PeriodFormatterBuilder;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+import com.typesafe.config.Config;
+
+import org.apache.gobblin.data.management.version.FileSystemDatasetVersion;
+import org.apache.gobblin.data.management.version.TimestampedDatasetVersion;
+import org.apache.gobblin.dataset.Dataset;
+import org.apache.gobblin.dataset.FileSystemDataset;
+import org.apache.gobblin.util.ConfigUtils;
+
+
+/**
+ * {@link DatasetVersionFinder} that constructs {@link 
TimestampedDatasetVersion}s without actually checking for existence
+ * of the version path. The version path is constructed by appending the 
version partition pattern to the dataset root.
+ * The versions are found by looking back a specific period of time and 
finding unique date partitions between that
+ * time and the current time.
+ */
+public class LookbackDateTimeDatasetVersionFinder extends 
DateTimeDatasetVersionFinder {
+  public static final String VERSION_PATH_PREFIX = "version.path.prefix";
+  public static final String VERSION_LOOKBACK_PERIOD = 
"version.lookback.period";
+
+  private final Duration stepDuration;
+  private final Period lookbackPeriod;
+  private final String pathPrefix;
+  private final Instant endTime;
+
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config) {
+    this(fs, config, Instant.now());
+  }
+
+  @VisibleForTesting
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config, 
Instant endTime) {
+    super(fs, config);
+    Preconditions.checkArgument(config.hasPath(VERSION_LOOKBACK_PERIOD) , 
"Missing required property " + VERSION_LOOKBACK_PERIOD);
+    PeriodFormatter periodFormatter =
+        new 
PeriodFormatterBuilder().appendYears().appendSuffix("y").appendMonths().appendSuffix("M").appendDays()
+            
.appendSuffix("d").appendHours().appendSuffix("h").appendMinutes().appendSuffix("m").toFormatter();
+    this.stepDuration = Duration.standardMinutes(1);
+    this.pathPrefix = ConfigUtils.getString(config, VERSION_PATH_PREFIX, "");
+    this.lookbackPeriod = 
periodFormatter.parsePeriod(config.getString(VERSION_LOOKBACK_PERIOD));
+    this.endTime = endTime;
+  }
+
+  @Override
+  public Class<? extends FileSystemDatasetVersion> versionClass() {
+    return TimestampedDatasetVersion.class;
+  }
+
+  @Override
+  public Collection<TimestampedDatasetVersion> findDatasetVersions(Dataset 
dataset) throws IOException {

Review Comment:
   Who is consuming these dataset versions? How do they handle paths that do 
not exist



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/DatasetCleaner.java:
##########
@@ -176,7 +176,7 @@ public void onSuccess(Void arg0) {
   @Override
   public void close() throws IOException {
     try {
-      if (this.finishCleanSignal.isPresent()) {
+      if (this.finishCleanSignal != null && 
this.finishCleanSignal.isPresent()) {

Review Comment:
   Null case happens when we close when we never cleaned right? 



##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/version/finder/LookbackDateTimeDatasetVersionFinder.java:
##########
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.gobblin.data.management.version.finder;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTime;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.joda.time.Period;
+import org.joda.time.format.PeriodFormatter;
+import org.joda.time.format.PeriodFormatterBuilder;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Preconditions;
+import com.typesafe.config.Config;
+
+import org.apache.gobblin.data.management.version.FileSystemDatasetVersion;
+import org.apache.gobblin.data.management.version.TimestampedDatasetVersion;
+import org.apache.gobblin.dataset.Dataset;
+import org.apache.gobblin.dataset.FileSystemDataset;
+import org.apache.gobblin.util.ConfigUtils;
+
+
+/**
+ * {@link DatasetVersionFinder} that constructs {@link 
TimestampedDatasetVersion}s without actually checking for existence
+ * of the version path. The version path is constructed by appending the 
version partition pattern to the dataset root.
+ * The versions are found by looking back a specific period of time and 
finding unique date partitions between that
+ * time and the current time.
+ */
+public class LookbackDateTimeDatasetVersionFinder extends 
DateTimeDatasetVersionFinder {
+  public static final String VERSION_PATH_PREFIX = "version.path.prefix";
+  public static final String VERSION_LOOKBACK_PERIOD = 
"version.lookback.period";
+
+  private final Duration stepDuration;
+  private final Period lookbackPeriod;
+  private final String pathPrefix;
+  private final Instant endTime;
+
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config) {
+    this(fs, config, Instant.now());
+  }
+
+  @VisibleForTesting
+  public LookbackDateTimeDatasetVersionFinder(FileSystem fs, Config config, 
Instant endTime) {
+    super(fs, config);
+    Preconditions.checkArgument(config.hasPath(VERSION_LOOKBACK_PERIOD) , 
"Missing required property " + VERSION_LOOKBACK_PERIOD);
+    PeriodFormatter periodFormatter =
+        new 
PeriodFormatterBuilder().appendYears().appendSuffix("y").appendMonths().appendSuffix("M").appendDays()
+            
.appendSuffix("d").appendHours().appendSuffix("h").appendMinutes().appendSuffix("m").toFormatter();
+    this.stepDuration = Duration.standardMinutes(1);

Review Comment:
   It's a bit wasteful but since we dedupe using a set, it should be pretty 
harmless





Issue Time Tracking
-------------------

    Worklog Id:     (was: 856242)
    Time Spent: 40m  (was: 0.5h)

> Add new lookback version finder for use with iceberg retention
> --------------------------------------------------------------
>
>                 Key: GOBBLIN-1809
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1809
>             Project: Apache Gobblin
>          Issue Type: Improvement
>            Reporter: Jack Moseley
>            Priority: Major
>          Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to