xy2953396112 commented on PR #3469:
URL: https://github.com/apache/celeborn/pull/3469#issuecomment-3426372082

   > Hi, I would add to the topic as I think this change addresses one of the 
bugs I'm experiencing
   > 
   > **TLDR: If I mix SSD and S3 tier on single worker instance LocalTierWriter 
is not able to evict or create partitions on S3. Using mixed tiers works if i 
use separate workers for S3 and for SSD tiers however in such case eviction 
from worker that is full cannot happen.**
   > 
   > Observed issue: S3 eviction path uses LocalTierWriter/NIO → 
NoSuchFileException, so SSD offload never happens and large shuffles fail
   > 
   > Build & env Celeborn 0.6.0-SNAPSHOT (git 
[b537798](https://github.com/apache/celeborn/commit/b537798e37be1e5d7e905af6c7a2df905f1b0da5)),
 Scala 2.13, Hadoop 3.3.6 s3a, AWS SDK 1.12.x. S3 tier enabled and initialized:
   > 
   > > StorageManager: Initialize S3 support with path s3a:///celeborn/
   > 
   > Key config:
   > 
   > > celeborn.storage.availableTypes=MEMORY,SSD,S3
   > > celeborn.worker.storage.storagePolicy.createFilePolicy=MEMORY,SSD,S3
   > > celeborn.worker.storage.storagePolicy.evictPolicy=MEMORY,SSD,S3
   > > celeborn.storage.s3.dir=s3a:///celeborn/
   > > celeborn.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
   > > celeborn.worker.storage.disk.reserve.size=5G
   > > capacity set to ~3.0 TiB; SSD fills during big shuffle
   > 
   > Symptom When SSD reaches the high-usage threshold, eviction to S3 does not 
start. The worker flips to HIGH_DISK_USAGE, and subsequent creates/writes go 
through LocalTierWriter using NIO against an s3a: URI, leading to 
NoSuchFileException. Example stack (many similar lines):
   > 
   > > ERROR PushDataHandler: Exception encountered when write.
   > > java.nio.file.NoSuchFileException: 
s3a://celeborn/.../application_.../0/173-58-1
   > > at java.nio.channels.FileChannel.open(FileChannel.java:298)
   > > at 
org.apache.celeborn.common.util.FileChannelUtils.createWritableFileChannel(FileChannelUtils.java:28)
   > > at 
org.apache.celeborn.service.deploy.worker.storage.LocalTierWriter.channel(TierWriter.scala:399)
   > > at 
org.apache.celeborn.service.deploy.worker.storage.LocalTierWriter.genFlushTask(TierWriter.scala:410)
   > > at 
org.apache.celeborn.service.deploy.worker.storage.TierWriterBase.flush(TierWriter.scala:195)
   > > at 
org.apache.celeborn.service.deploy.worker.storage.LocalTierWriter.writeInternal(TierWriter.scala:419)
   > > ...
   > 
   > Disk monitor shows the SSD is full from Celeborn’s perspective:
   > 
   > > WARN DeviceMonitor: /mnt/celeborn usage is above threshold...
   > > ... usage(Report by Celeborn): { total:2.9 TiB, free:0.0 B }
   > > DEBUG StorageManager: ... usableSpace:0 ... status: HIGH_DISK_USAGE
   > 
   > Commit phase fails with many partitions not committed due to the same 
NIO/S3 path mismatch:
   > 
   > > ERROR Controller: Commit file ... failed.
   > > java.nio.file.NoSuchFileException: s3a://celeborn/.../165-20-1
   > > ...
   > > WARN Controller: CommitFiles ... 291 committed primary, 47 failed 
primary, 563 committed replica, 118 failed replica.
   > 
   > Cleaner thread also hits a DFS handle issue (likely a side-effect of the 
wrong writer path):
   > 
   > > ERROR worker-expired-shuffle-cleaner ...
   > > java.lang.NullPointerException: ... FileSystem.delete(...) because 
"dfsFs" is null
   > 
   > Inconsistent S3 exposure in heartbeats Before a restart one worker 
sometimes advertises huge S3 available slots while SSD is HIGH_DISK_USAGE:
   > 
   > > "S3": "DiskInfo(maxSlots: 0, availableSlots: 137438953471, ... 
storageType: S3) status: HEALTHY"
   > > "/mnt/celeborn": "... usableSpace: 0.0 B ... status: HIGH_DISK_USAGE"
   > 
   > After restart, S3 availableSlots drops to 0 on that same host, and SSD 
shows space again. Behavior suggests the tier choice is not consistently 
honored at file creation time.
   > 
   > What I expected With createFilePolicy/evictPolicy = MEMORY,SSD,S3, once 
SSD approaches the limit, new/evicted partitions should be written via the 
S3/DFS writer, not through LocalTierWriter/NIO.
   > 
   > Why this PR looks relevant My stack shows S3 is selected by policy but the 
worker still constructs a local “disk file” writer and then tries to open an 
s3a: path with NIO (FileChannel.open), which cannot work.
   > 
   > In the the commit 
https://github.com/apache/celeborn/pull/3469/files#diff-332230b33db740720657fd9c90e4f4eb0bce18b43f4749fddfe37463cc11a9b1
 change seems to make StorageManager.createPartition(...) receive and honor the 
intended storageType, routing S3 creates through the DFS/S3 writer path instead 
of LocalTierWriter. That seems to directly address this mismatch.
   > 
   > Could you confirm this could help with the issue I'm experiencing? If this 
have chance to solve the issue I would rebuild Celeborn with your patch and run 
some saturation tests.
   
   Thank you for your comment, it seems highly relevant.
   
   1. When the local disk is full or in a `HIGH_DISK_USAGE` state, 
`StorageManager#createDiskFile` will create a DFS-based DiskFileInfo, which in 
turn creates a `DfsTierWriter`.
   2. In our internal implementation, we first create a LocalTierWriter. When 
the write threshold for a `PartitionLocation` is reached, it will return 
`SOFT_SPLIT` or `HARD_SPLIT`. The LifecycleManager will count the amount of 
data written to each Partition, and when it exceeds the threshold, it will 
change the StorageType in `PartitionLocation` and write the data to `DFS 
storage`.
   3. In the current code, when a PartitionLocation specifies a `DFS 
StorageType`, it can directly create `HDFS, OSS, or S3` storage based on the 
StorageType, instead of first creating a LocalDisk.
   
   You can use this code for verification, and we hope it will resolve your 
issue. If you encounter any problems, feel free to discuss them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to