[jira] [Work logged] (GOBBLIN-2026) Retention Job should fail on OOM

ASF GitHub Bot (Jira) Thu, 04 Apr 2024 22:14:41 -0700


     [ 
https://issues.apache.org/jira/browse/GOBBLIN-2026?focusedWorklogId=913139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-913139
 ]


ASF GitHub Bot logged work on GOBBLIN-2026:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Apr/24 05:12
            Start Date: 05/Apr/24 05:12
    Worklog Time Spent: 10m 
      Work Description: arpit09 commented on code in PR #3913:
URL: https://github.com/apache/gobblin/pull/3913#discussion_r1552932613


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/retention/DatasetCleaner.java:
##########
@@ -155,6 +155,9 @@ public Void call() throws Exception {
         public void onFailure(Throwable throwable) {
           DatasetCleaner.this.finishCleanSignal.get().countDown();
           LOG.warn("Exception caught when cleaning " + dataset.datasetURN() + 
".", throwable);
+          if (throwable instanceof OutOfMemoryError) {

Review Comment:
   Actual issue was due to the race condition with the countdownLatch, have 
fixed it. Added the RCA in the description





Issue Time Tracking
-------------------

    Worklog Id:     (was: 913139)
    Time Spent: 1h 50m  (was: 1h 40m)

> Retention Job should fail on OOM
> --------------------------------
>
>                 Key: GOBBLIN-2026
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-2026
>             Project: Apache Gobblin
>          Issue Type: Improvement
>          Components: misc
>            Reporter: Arpit Varshney
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, while cleaning the log files, the Retention job goes into OOM and 
> silently fails when the no of log files is too many. Workflow execution even 
> after failure says Success.
> {code:java}
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO - 
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at java.util.Arrays.copyOf(Arrays.java:3332)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at java.lang.StringBuffer.append(StringBuffer.java:270)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at java.net.URI.appendSchemeSpecificPart(URI.java:1911)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at java.net.URI.toString(URI.java:1941)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at java.net.URI.<init>(URI.java:742)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at org.apache.hadoop.fs.Path.makeQualified(Path.java:562)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.hdfs.protocol.HdfsFileStatus.makeQualified(HdfsFileStatus.java:271)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:997)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:121)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1047)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1057)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.InstrumentedFileSystem.lambda$listStatus$17(InstrumentedFileSystem.java:379)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.InstrumentedFileSystem$$Lambda$69/231154485.get(Unknown 
> Source)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFileSystem.process(PerformanceTrackingFileSystem.java:412)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.InstrumentedFileSystem.process(InstrumentedFileSystem.java:100)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.InstrumentedFileSystem.listStatus(InstrumentedFileSystem.java:379)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.PerformanceTrackingDistributedFileSystem.listStatus(PerformanceTrackingDistributedFileSystem.java:296)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listStatus(ChRootedFileSystem.java:253)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.viewfs.ViewFileSystem.listStatus(ViewFileSystem.java:528)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at 
> org.apache.hadoop.fs.GridFilesystem.lambda$listStatus$4(GridFilesystem.java:491)
> 21-03-2024 01:00:03 PDT jobs-kafkaetl-gobblin-streaming-logs-cleaner INFO -   
> at org.apache.hadoop.fs.GridFilesystem$$Lambda$68/2109027988.doCall(Unknown 
> Source) {code}
> As the job silently fails, user doesn't get to know explicitly about it. 
> Hence, when going into OOM, retention job should explicitly fail if it can't 
> be proceeded further



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (GOBBLIN-2026) Retention Job should fail on OOM

Reply via email to