Kalvin2077 commented on code in PR #3353:
URL: https://github.com/apache/celeborn/pull/3353#discussion_r2234787782


##########
worker/src/main/scala/org/apache/celeborn/service/deploy/worker/storage/StorageManager.scala:
##########
@@ -165,6 +157,7 @@ final private[worker] class StorageManager(conf: 
CelebornConf, workerSource: Abs
     (flushers, totalThread)
   }
 
+  // TODO

Review Comment:
   I has tested it with only HDFS and the Celeborn can shuffle normally.
   
   The following is the test configuration:
   
   ```
   ...
   ...
   # celeborn.worker.storage.dirs 
/mnt/disk1/celeborn:disktype=SSD,/mnt/disk2/celeborn:disktype=SSD,/mnt/disk3/celeborn:disktype=SSD,/mnt/disk4/celeborn:disktype=SSD
   celeborn.storage.availableTypes HDFS
   celeborn.storage.hdfs.dir hdfs://master-1-1:9000/user/emr-user/celeborn
   
   celeborn.logConf.enabled true
   ```
   
   The following is the debug log from one executor:
   
   ```
   Log Type: stderr
   
   Log Upload Time: Mon Jul 28 12:39:53 +0800 2025
   
   Log Length: 19805
   
   Showing 4096 bytes of 19805 total. Click 
[here](http://master-1-1.c-53a05cc9683197a2.cn-hangzhou.emr.aliyuncs.com:19888/jobhistory/logs/core-1-2.c-53a05cc9683197a2.cn-hangzhou.emr.aliyuncs.com:8041/container_1753667665029_0009_01_000002/container_1753667665029_0009_01_000002/emr-user/stderr/?start=0&start.time=0&end.time=9223372036854775807)
 for the full log.
   
   age 0.0 (TID 12)] Executor: Finished task 12.0 in stage 0.0 (TID 12). 2498 
bytes result sent to driver
   25/07/28 12:39:41 INFO [dispatcher-Executor] 
YarnCoarseGrainedExecutorBackend: Got assigned task 14
   25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage 
0.0 (TID 14)] Executor: Running task 14.0 in stage 0.0 (TID 14)
   25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage 
0.0 (TID 14)] HadoopRDD: Input split: 
hdfs://master-1-1.c-53a05cc9683197a2.cn-hangzhou.emr.aliyuncs.com:9000/user/hive/warehouse/mock.db/orders/part-00001-ec695775-417c-43d1-aeb9-89ea5766571b-c000:20394822+3399137
   25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage 
0.0 (TID 14)] SortBasedShuffleWriter: Memory used 40.0 MiB
   25/07/28 12:39:41 INFO [Executor task launch worker for task 14.0 in stage 
0.0 (TID 14)] Executor: Finished task 14.0 in stage 0.0 (TID 14). 2498 bytes 
result sent to driver
   25/07/28 12:39:41 INFO [dispatcher-Executor] 
YarnCoarseGrainedExecutorBackend: Got assigned task 17
   25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] Executor: Running task 1.0 in stage 2.0 (TID 17)
   25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
   25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] TorrentBroadcast: Started reading broadcast variable 2 with 1 
pieces (estimated total size 4.0 MiB)
   25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] MemoryStore: Block broadcast_2_piece0 stored as bytes in memory 
(estimated size 19.9 KiB, free 925.7 MiB)
   25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] TorrentBroadcast: Reading broadcast variable 2 took 9 ms
   25/07/28 12:39:41 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] MemoryStore: Block broadcast_2 stored as values in memory 
(estimated size 42.2 KiB, free 925.7 MiB)
   25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] ShuffleClientImpl: Shuffle 0 request reducer file group success 
using 2486 ms, result partition size 200.
   25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] CelebornShuffleReader: BatchOpenStream for 101 cost 33ms
   25/07/28 12:39:44 INFO [celeborn-create-stream-thread-8] 
CelebornHadoopUtils: Celeborn overrides some HDFS settings defined in Hadoop 
configuration files, including 'fs.hdfs.impl.disable.cache=false' and 
'dfs.replication=2'. It can be overridden again in Celeborn configuration with 
the additional prefix 'celeborn.hadoop.', e.g. 
'celeborn.hadoop.dfs.replication=3'
   25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] CodeGenerator: Code generated in 41.724212 ms
   25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] CelebornShuffleReader: inputStream for partition: 99 is null, 
sleeping 5ms
   25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] CelebornShuffleReader: inputStream for partition: 99 is not null, 
sleep 2 times for 10 ms
   25/07/28 12:39:44 INFO [Executor task launch worker for task 1.0 in stage 
2.0 (TID 17)] Executor: Finished task 1.0 in stage 2.0 (TID 17). 838094 bytes 
result sent to driver
   25/07/28 12:39:51 INFO [dispatcher-Executor] 
YarnCoarseGrainedExecutorBackend: Driver commanded a shutdown
   25/07/28 12:39:51 INFO [CoarseGrainedExecutorBackend-stop-executor] 
RpcMetricsTracker: RPC statistics for endpoint-verifier (time unit: ns)
   current queue size = 1
   max queue length = 1
   ```
   
   PTAL @RexXiong @SteNicholas 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to