zy-jordan opened a new pull request, #1889:
URL: https://github.com/apache/incubator-celeborn/pull/1889

   …newly app dirs
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] 
Your PR title ...'.
     - Be sure to keep the PR description updated to reflect all changes.
     - Please write your PR title to summarize what this PR proposes.
     - If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   
   Worker throw FileNotFoundException while fetch chunk:
   ```
   java.io.FileNotFoundException: 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/871-0-0
 (No such file or directory
   ```
   before commit shuffle files, files are deleted in storage-scheduler thread
   ```
   2023-09-07 19:38:16,506 [INFO] [dispatcher-event-loop-44] - 
org.apache.celeborn.service.deploy.worker.storage.StorageManager 
-Logging.scala(51) -Create file 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/986-0-0
 success
   2023-09-07 19:38:16,506 [INFO] [dispatcher-event-loop-44] - 
org.apache.celeborn.service.deploy.worker.Controller -Logging.scala(51) 
-Reserved 29 primary location and 0 replica location for 
application_1693206141914_540726_1-1 
   2023-09-07 19:38:16,537 [INFO] [storage-scheduler] - 
org.apache.celeborn.service.deploy.worker.storage.StorageManager 
-Logging.scala(51) -Delete expired app dir 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
   2023-09-07 19:38:16,580 [INFO] [storage-scheduler] - 
org.apache.celeborn.service.deploy.worker.storage.StorageManager 
-Logging.scala(51) -Delete expired app dir 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
   2023-09-07 19:38:16,629 [INFO] [storage-scheduler] - 
org.apache.celeborn.service.deploy.worker.storage.StorageManager 
-Logging.scala(51) -Delete expired app dir 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
   2023-09-07 19:38:16,661 [INFO] [storage-scheduler] - 
org.apache.celeborn.service.deploy.worker.storage.StorageManager 
-Logging.scala(51) -Delete expired app dir 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
   2023-09-07 19:38:16,681 [INFO] [storage-scheduler] - 
org.apache.celeborn.service.deploy.worker.storage.StorageManager 
-Logging.scala(51) -Delete expired app dir 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
   2023-09-07 19:38:17,355 [INFO] [dispatcher-event-loop-12] - 
org.apache.celeborn.service.deploy.worker.Controller -Logging.scala(51) -Start 
commitFiles for application_1693206141914_540726_1-1
   2023-09-07 19:38:17,362 [INFO] [async-reply] - 
org.apache.celeborn.service.deploy.worker.Controller -Logging.scala(51) 
-CommitFiles for application_1693206141914_540726_1-1 success with 29 committed 
primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 
committed replica partitions, 0 empty replica partitions, 0 failed replica 
partitions.
   java.io.FileNotFoundException: 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/976-0-0
 (No such file or directory)
   java.io.FileNotFoundException: 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/482-0-0
 (No such file or directory)
   java.io.FileNotFoundException: 
/xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/658-0-0
 (No such file or directory)
   ```
   it may have concurrent problem in this method.
   ``` scala
   private def cleanupExpiredAppDirs(): Unit = {
     val appIds = shuffleKeySet().asScala.map(key => 
Utils.splitShuffleKey(key)._1)
     disksSnapshot().filter(_.status != DiskStatus.IO_HANG).foreach { diskInfo 
=>
       diskInfo.dirs.foreach {
         case workingDir if workingDir.exists() =>
           workingDir.listFiles().foreach { appDir =>
             // Don't delete shuffleKey's data that exist correct shuffle file 
info.
             if (!appIds.contains(appDir.getName)) {
               val threadPool = diskOperators.get(diskInfo.mountPoint)
               deleteDirectory(appDir, threadPool)
               logInfo(s"Delete expired app dir $appDir.")
             }
           }
         // workingDir not exist when initializing worker on new disk
         case _ => // do nothing
       }
     }
   }
   ```
   maybe, we should find all app directories first, then get the active shuffle 
keys.
   
   https://issues.apache.org/jira/browse/CELEBORN-881
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to