GitHub user Susmit07 added a comment to the discussion: Pekko Cluster Sharding - Race condition
What a superb response. I just now completed my POC, let me share my findings > Yes: in a healthy cluster, you can be confident there will be a maximum of > one actor running for each entity id. As such cluster sharding is indeed > sufficient to make sure "a file being processed once at a given time across > all the pods in the cluster" You are correct, my POC proved it, the entityId i chose was the hdfs / s3 directory which the schedulers were polling at a specified interval, all the files in a specifie directory were processed in the same k8s pod - No explicit lock needed > As mentioned in https://github.com/apache/pekko-connectors/discussions/814, > on its own cluster sharding is not sufficient to get exactly-once delivery: > when this file processing is interrupted for some reason, you need some way > to make sure you can decide whether you need to re-start/resume this > processing. You don't need any additional locking for this, but indeed making > the upload idempotent would help to solve this aspect. I am planning to have a object store marker file per directory or entityId and have the status of all successfully processed files under the directory as a metadata to marker file during every poll it should compare the delta, whichever file got successfully processed will be ignored. (Redis / DynamoDB would be ideal but we have constraint, same with Pekko Persistance of relational DBs) > Actors are typically cheap, so in that sense the number of actors will not be > a performance bottleneck. However, if there are many files, I could imagine > you'd overload your system by starting too many uploads in parallel. You > could probably restrict the number of parallel uploads from your HDFS > scanning code. As of now i am going with actor per directory with the entityId being directory path, but i am curious to know how to restrict the number of parallel uploads from your HDFS scanning code (any suggestions @raboof ) GitHub link: https://github.com/apache/pekko/discussions/1508#discussioncomment-10799719 ---- This is an automatically sent email for notifications@pekko.apache.org. To unsubscribe, please send an email to: notifications-unsubscr...@pekko.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: notifications-unsubscr...@pekko.apache.org For additional commands, e-mail: notifications-h...@pekko.apache.org