Kalvin2077 commented on code in PR #3353:
URL: https://github.com/apache/celeborn/pull/3353#discussion_r2234787782
##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala:
##########
@@ -165,6 +157,7 @@ final private[worker] class StorageManager(conf:
CelebornConf, workerSource: Abs
(flushers, totalThread)
}
+ // TODO
Review Comment:
I has tested it with only HDFS and the Celeborn can shuffle normally.
The following is the test configuration:
```
...
...
# celeborn.worker.storage.dirs
/mnt/disk1/celeborn:disktype=SSD,/mnt/disk2/celeborn:disktype=SSD,/mnt/disk3/celeborn:disktype=SSD,/mnt/disk4/celeborn:disktype=SSD
celeborn.storage.availableTypes HDFS
celeborn.storage.hdfs.dir hdfs://master-1-1:9000/user/emr-user/celeborn
celeborn.logConf.enabled true
```
The following is the debug log from one executor:
```
Log Type: stderr
Log Upload Time: Mon Jul 28 12:39:53 +0800 2025
Log Length: 19805
Showing 4096 bytes of 19805 total. Click
[here](http://master-1-1.c-53a05cc9683197a2.cn-hangzhou.emr.aliyuncs.com:19888/jobhistory/logs/core-1-2.c-53a05cc9683197a2.cn-hangzhou.emr.aliyuncs.com:8041/container_1753667665029_0009_01_000002/container_1753667665029_0009_01_000002/emr-user/stderr/?start=0&start.time=0&end.time=9223372036854775807)
for the full log.
age 0.0 (TID 12)] Executor: Finished task 12.0 in stage 0.0 (TID 12). 2498
bytes result sent to driver
25/07/28 12:39:41 INFO [dispatcher-Executor]
YarnCoarseGrainedExecutorBackend: Got assigned task 14
25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage
0.0 (TID 14)] Executor: Running task 14.0 in stage 0.0 (TID 14)
25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage
0.0 (TID 14)] HadoopRDD: Input split:
hdfs://master-1-1.c-53a05cc9683197a2.cn-hangzhou.emr.aliyuncs.com:9000/user/hive/warehouse/mock.db/orders/part-00001-ec695775-417c-43d1-aeb9-89ea5766571b-c000:20394822+3399137
25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage
0.0 (TID 14)] SortBasedShuffleWriter: Memory used 40.0 MiB
25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage
0.0 (TID 14)] Executor: Finished task 14.0 in stage 0.0 (TID 14). 2498 bytes
result sent to driver
25/07/28 12:39:41 INFO [dispatcher-Executor]
YarnCoarseGrainedExecutorBackend: Got assigned task 17
25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] Executor: Running task 1.0 in stage 2.0 (TID 17)
25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] TorrentBroadcast: Started reading broadcast variable 2 with 1
pieces (estimated total size 4.0 MiB)
25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] MemoryStore: Block broadcast_2_piece0 stored as bytes in memory
(estimated size 19.9 KiB, free 925.7 MiB)
25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] TorrentBroadcast: Reading broadcast variable 2 took 9 ms
25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] MemoryStore: Block broadcast_2 stored as values in memory
(estimated size 42.2 KiB, free 925.7 MiB)
25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] ShuffleClientImpl: Shuffle 0 request reducer file group success
using 2486 ms, result partition size 200.
25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] CelebornShuffleReader: BatchOpenStream for 101 cost 33ms
25/07/28 12:39:44 INFO [celeborn-create-stream-thread-8]
CelebornHadoopUtils: Celeborn overrides some HDFS settings defined in Hadoop
configuration files, including 'fs.hdfs.impl.disable.cache=false' and
'dfs.replication=2'. It can be overridden again in Celeborn configuration with
the additional prefix 'celeborn.hadoop.', e.g.
'celeborn.hadoop.dfs.replication=3'
25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] CodeGenerator: Code generated in 41.724212 ms
25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] CelebornShuffleReader: inputStream for partition: 99 is null,
sleeping 5ms
25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] CelebornShuffleReader: inputStream for partition: 99 is not null,
sleep 2 times for 10 ms
25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage
2.0 (TID 17)] Executor: Finished task 1.0 in stage 2.0 (TID 17). 838094 bytes
result sent to driver
25/07/28 12:39:51 INFO [dispatcher-Executor]
YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
25/07/28 12:39:51 INFO [CoarseGrainedExecutorBackend-stop-executor]
RpcMetricsTracker: RPC statistics for endpoint-verifier (time unit: ns)
current queue size = 1
max queue length = 1
```
PTAL @RexXiong @SteNicholas
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]