phet commented on code in PR #4044:
URL: https://github.com/apache/gobblin/pull/4044#discussion_r1776029345


##########
gobblin-core/src/main/java/org/apache/gobblin/source/DatePartitionedNestedRetriever.java:
##########
@@ -129,8 +134,15 @@ public List<FileInfo> getFilesToProcess(long minWatermark, 
int maxFilesToReturn)
               new FileInfo(fileStatus.getPath().toString(), 
fileStatus.getLen(), date.getMillis(), partitionPath));
         }
       }
+
+      if (growthTracker.isAnotherMilestone(iteration++)) {
+        LOG.info("~{}~ collected files to process", filesToProcess.size());

Review Comment:
   wouldn't this log the same number of files every time there's a milestone?  
I believe you intend to log how many files were processed thus far when 
reaching each milestone... correct?
   
   ideally you might put some description in the first substitution (`~{}~`) to 
contextualize the processing, and add the size *in addition*.
   
   also, collapse adjacent logging calls, rather than keeping separate.  e.g.
   ```
   LOG.info("~{}~ collected {} (of {}) files to process; most-recent source 
path: '{}'",
      description, iteration, files.size(), sourcePath);
   ```



##########
gobblin-core/src/main/java/org/apache/gobblin/source/DatePartitionedNestedRetriever.java:
##########
@@ -129,8 +134,15 @@ public List<FileInfo> getFilesToProcess(long minWatermark, 
int maxFilesToReturn)
               new FileInfo(fileStatus.getPath().toString(), 
fileStatus.getLen(), date.getMillis(), partitionPath));
         }
       }
+
+      if (growthTracker.isAnotherMilestone(iteration++)) {
+        LOG.info("~{}~ collected files to process", filesToProcess.size());
+        LOG.info("Last Source Path processed : ~{}~", sourcePath);
+      }
     }
 
+    LOG.info("Finished processing files");

Review Comment:
   same here.  also, let's provide the count - `filesToProcess.size()`



##########
gobblin-api/src/main/java/org/apache/gobblin/configuration/ConfigurationKeys.java:
##########
@@ -347,6 +347,11 @@ public class ConfigurationKeys {
    */
   public static final String WATERMARK_INTERVAL_VALUE_KEY = 
"watermark.interval.value";
 
+  /**
+   * DEFAULT LOOKBACK TIME KEY property
+   */
+  public static final String DEFAULT_COPY_LOOKBACK_TIME_KEY = 
"copy.lookbackTime";

Review Comment:
   not 100% clear here...
   
   is the issue that:
   * the same prop name is already in use elsewhere so we should avoid 
conflicting? 
   * that a config w/ similar semantics already exists, and we might 
intentionally borrow it here for uniformity?
   * or something else...?



##########
gobblin-core/src/main/java/org/apache/gobblin/source/PartitionAwareFileRetrieverUtils.java:
##########
@@ -52,4 +60,29 @@ public static Duration getLeadTimeDurationFromConfig(State 
state) {
 
     return new Duration(leadTime * leadTimeGranularity.getUnitMilliseconds());
   }
+
+  /**
+   * Calculates the lookback time duration based on the provided lookback time 
string.
+   *
+   * @param lookBackTime the lookback time string, which should include a 
numeric value followed by a time unit character.
+   *                     For example, "5d" for 5 days or "10h" for 10 hours.
+   * @return an {@link Optional} containing the {@link Duration} if the 
lookback time is valid, or
+   *         an empty {@link Optional} if the lookback time is invalid or 
cannot be parsed.
+   */
+  public static Optional<Duration> getLookbackTimeDuration(String 
lookBackTime) {

Review Comment:
   although I am a big proponent of `Optional`, in this case it loses the 
specific errors, putting them in the log rather than "returning" them to the 
caller, where they might be presented in a meaningful end-user error message 
(rather than logs, which are for developers).
   
   since string parsing like this regularly requires error handling, it seems 
reasonable and basically expected to throw an exception.  you can avow in 
javadoc that the `Duration` return value is never `null`



##########
gobblin-core/src/main/java/org/apache/gobblin/source/PartitionedFileSourceBase.java:
##########
@@ -370,4 +394,11 @@ private long getLowWaterMark(Iterable<WorkUnitState> 
previousStates, String lowW
   protected PartitionAwareFileRetriever getRetriever() {
     return retriever;
   }
+
+  private String getLookBackTimeProp(SourceState state) {

Review Comment:
   could be `static`



##########
gobblin-core/src/main/java/org/apache/gobblin/source/DatePartitionedNestedRetriever.java:
##########
@@ -114,6 +115,10 @@ public List<FileInfo> getFilesToProcess(long minWatermark, 
int maxFilesToReturn)
       throw new IOException("Error initializing FileSystem", e);
     }
 
+    GrowthMilestoneTracker growthTracker = new GrowthMilestoneTracker();
+    Long iteration = 0L;
+    LOG.info("Starting processing files from {} to {}", lowWaterMarkDate, 
currentDay);

Review Comment:
   please add context about which partition of which dataset



##########
gobblin-core/src/main/java/org/apache/gobblin/source/PartitionAwareFileRetrieverUtils.java:
##########
@@ -32,6 +39,7 @@
  * objects.
  */
 public class PartitionAwareFileRetrieverUtils {
+  private static final Logger LOG = 
LoggerFactory.getLogger(PartitionAwareFileRetrieverUtils.class);

Review Comment:
   can we use the lombok annotation `@Slf4j`?



##########
gobblin-core/src/main/java/org/apache/gobblin/source/PartitionAwareFileRetrieverUtils.java:
##########
@@ -52,4 +60,29 @@ public static Duration getLeadTimeDurationFromConfig(State 
state) {
 
     return new Duration(leadTime * leadTimeGranularity.getUnitMilliseconds());
   }
+
+  /**
+   * Calculates the lookback time duration based on the provided lookback time 
string.
+   *
+   * @param lookBackTime the lookback time string, which should include a 
numeric value followed by a time unit character.
+   *                     For example, "5d" for 5 days or "10h" for 10 hours.

Review Comment:
   I like the flexibility here (often we've asked for two props--one for the 
number and the other for the time unit).
   
   in any event, please link to the spec that documents the suffixes and what 
they mean (e.g. 'd', 'h', 'M', etc.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to