[
https://issues.apache.org/jira/browse/GOBBLIN-2026?focusedWorklogId=913139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-913139
]
ASF GitHub Bot logged work on GOBBLIN-2026:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 05/Apr/24 05:12
Start Date: 05/Apr/24 05:12
Worklog Time Spent: 10m
Work Description: arpit09 commented on code in PR #3913:
URL: https://github.com/apache/gobblin/pull/3913#discussion_r1552932613
##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/DatasetCleaner.java:
##########
@@ -155,6 +155,9 @@ public Void call() throws Exception {
public void onFailure(Throwable throwable) {
DatasetCleaner.this.finishCleanSignal.get().countDown();
LOG.warn("Exception caught when cleaning " + dataset.datasetURN() +
".", throwable);
+ if (throwable instanceof OutOfMemoryError) {
Review Comment:
Actual issue was due to the race condition with the countdownLatch, have
fixed it. Added the RCA in the description
Issue Time Tracking
-------------------
Worklog Id: (was: 913139)
Time Spent: 1h 50m (was: 1h 40m)
> Retention Job should fail on OOM
> --------------------------------
>
> Key: GOBBLIN-2026
> URL: https://issues.apache.org/jira/browse/GOBBLIN-2026
> Project: Apache Gobblin
> Issue Type: Improvement
> Components: misc
> Reporter: Arpit Varshney
> Priority: Major
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> Currently, while cleaning the log files, the Retention job goes into OOM and
> silently fails when the no of log files is too many. Workflow execution even
> after failure says Success.
> {code:java}
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at java.util.Arrays.copyOf(Arrays.java:3332)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at java.lang.StringBuffer.append(StringBuffer.java:270)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at java.net.URI.appendSchemeSpecificPart(URI.java:1911)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at java.net.URI.toString(URI.java:1941)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at java.net.URI.<init>(URI.java:742)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at org.apache.hadoop.fs.Path.makeQualified(Path.java:562)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:271)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:997)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:121)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1047)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1057)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.InstrumentedFileSystem.lambda$listStatus$17(InstrumentedFileSystem.java:379)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.InstrumentedFileSystem$$Lambda$69/231154485.get(Unknown
> Source)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFileSystem.process(PerformanceTrackingFileSystem.java:412)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.InstrumentedFileSystem.process(InstrumentedFileSystem.java:100)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.InstrumentedFileSystem.listStatus(InstrumentedFileSystem.java:379)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.PerformanceTrackingDistributedFileSystem.listStatus(PerformanceTrackingDistributedFileSystem.java:296)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listStatus(ChRootedFileSystem.java:253)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:528)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at
> org.apache.hadoop.fs.GridFilesystem.lambda$listStatus$4(GridFilesystem.java:491)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -
> at org.apache.hadoop.fs.GridFilesystem$$Lambda$68/2109027988.doCall(Unknown
> Source) {code}
> As the job silently fails, user doesn't get to know explicitly about it.
> Hence, when going into OOM, retention job should explicitly fail if it can't
> be proceeded further
--
This message was sent by Atlassian Jira
(v8.20.10#820010)