xy2953396112 commented on PR #3469: URL: https://github.com/apache/celeborn/pull/3469#issuecomment-3426372082
> Hi, I would add to the topic as I think this change addresses one of the bugs I'm experiencing > > **TLDR: If I mix SSD and S3 tier on single worker instance LocalTierWriter is not able to evict or create partitions on S3. Using mixed tiers works if i use separate workers for S3 and for SSD tiers however in such case eviction from worker that is full cannot happen.** > > Observed issue: S3 eviction path uses LocalTierWriter/NIO → NoSuchFileException, so SSD offload never happens and large shuffles fail > > Build & env Celeborn 0.6.0-SNAPSHOT (git [b537798](https://github.com/apache/celeborn/commit/b537798e37be1e5d7e905af6c7a2df905f1b0da5)), Scala 2.13, Hadoop 3.3.6 s3a, AWS SDK 1.12.x. S3 tier enabled and initialized: > > > StorageManager: Initialize S3 support with path s3a:///celeborn/ > > Key config: > > > celeborn.storage.availableTypes=MEMORY,SSD,S3 > > celeborn.worker.storage.storagePolicy.createFilePolicy=MEMORY,SSD,S3 > > celeborn.worker.storage.storagePolicy.evictPolicy=MEMORY,SSD,S3 > > celeborn.storage.s3.dir=s3a:///celeborn/ > > celeborn.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem > > celeborn.worker.storage.disk.reserve.size=5G > > capacity set to ~3.0 TiB; SSD fills during big shuffle > > Symptom When SSD reaches the high-usage threshold, eviction to S3 does not start. The worker flips to HIGH_DISK_USAGE, and subsequent creates/writes go through LocalTierWriter using NIO against an s3a: URI, leading to NoSuchFileException. Example stack (many similar lines): > > > ERROR PushDataHandler: Exception encountered when write. > > java.nio.file.NoSuchFileException: s3a://celeborn/.../application_.../0/173-58-1 > > at java.nio.channels.FileChannel.open(FileChannel.java:298) > > at org.apache.celeborn.common.util.FileChannelUtils.createWritableFileChannel(FileChannelUtils.java:28) > > at org.apache.celeborn.service.deploy.worker.storage.LocalTierWriter.channel(TierWriter.scala:399) > > at org.apache.celeborn.service.deploy.worker.storage.LocalTierWriter.genFlushTask(TierWriter.scala:410) > > at org.apache.celeborn.service.deploy.worker.storage.TierWriterBase.flush(TierWriter.scala:195) > > at org.apache.celeborn.service.deploy.worker.storage.LocalTierWriter.writeInternal(TierWriter.scala:419) > > ... > > Disk monitor shows the SSD is full from Celeborn’s perspective: > > > WARN DeviceMonitor: /mnt/celeborn usage is above threshold... > > ... usage(Report by Celeborn): { total:2.9 TiB, free:0.0 B } > > DEBUG StorageManager: ... usableSpace:0 ... status: HIGH_DISK_USAGE > > Commit phase fails with many partitions not committed due to the same NIO/S3 path mismatch: > > > ERROR Controller: Commit file ... failed. > > java.nio.file.NoSuchFileException: s3a://celeborn/.../165-20-1 > > ... > > WARN Controller: CommitFiles ... 291 committed primary, 47 failed primary, 563 committed replica, 118 failed replica. > > Cleaner thread also hits a DFS handle issue (likely a side-effect of the wrong writer path): > > > ERROR worker-expired-shuffle-cleaner ... > > java.lang.NullPointerException: ... FileSystem.delete(...) because "dfsFs" is null > > Inconsistent S3 exposure in heartbeats Before a restart one worker sometimes advertises huge S3 available slots while SSD is HIGH_DISK_USAGE: > > > "S3": "DiskInfo(maxSlots: 0, availableSlots: 137438953471, ... storageType: S3) status: HEALTHY" > > "/mnt/celeborn": "... usableSpace: 0.0 B ... status: HIGH_DISK_USAGE" > > After restart, S3 availableSlots drops to 0 on that same host, and SSD shows space again. Behavior suggests the tier choice is not consistently honored at file creation time. > > What I expected With createFilePolicy/evictPolicy = MEMORY,SSD,S3, once SSD approaches the limit, new/evicted partitions should be written via the S3/DFS writer, not through LocalTierWriter/NIO. > > Why this PR looks relevant My stack shows S3 is selected by policy but the worker still constructs a local “disk file” writer and then tries to open an s3a: path with NIO (FileChannel.open), which cannot work. > > In the the commit https://github.com/apache/celeborn/pull/3469/files#diff-332230b33db740720657fd9c90e4f4eb0bce18b43f4749fddfe37463cc11a9b1 change seems to make StorageManager.createPartition(...) receive and honor the intended storageType, routing S3 creates through the DFS/S3 writer path instead of LocalTierWriter. That seems to directly address this mismatch. > > Could you confirm this could help with the issue I'm experiencing? If this have chance to solve the issue I would rebuild Celeborn with your patch and run some saturation tests. Thank you for your comment, it seems highly relevant. 1. When the local disk is full or in a `HIGH_DISK_USAGE` state, `StorageManager#createDiskFile` will create a DFS-based DiskFileInfo, which in turn creates a `DfsTierWriter`. 2. In our internal implementation, we first create a LocalTierWriter. When the write threshold for a `PartitionLocation` is reached, it will return `SOFT_SPLIT` or `HARD_SPLIT`. The LifecycleManager will count the amount of data written to each Partition, and when it exceeds the threshold, it will change the StorageType in `PartitionLocation` and write the data to `DFS storage`. 3. In the current code, when a PartitionLocation specifies a `DFS StorageType`, it can directly create `HDFS, OSS, or S3` storage based on the StorageType, instead of first creating a LocalDisk. You can use this code for verification, and we hope it will resolve your issue. If you encounter any problems, feel free to discuss them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
