GitHub user Susmit07 edited a comment on the discussion: Pekko Cluster Sharding 
- Race condition

What a superb response.

I just now completed my POC, let me share my findings

> Yes: in a healthy cluster, you can be confident there will be a maximum of 
> one actor running for each entity id. As such cluster sharding is indeed 
> sufficient to make sure "a file being processed once at a given time across 
> all the pods in the cluster"

You are correct, my POC proved it, the entityId i chose was the hdfs / s3 
directory which the schedulers were polling at a specified interval, all the 
files in a specifie directory were processed in the same k8s pod - No explicit 
lock needed

> As mentioned in https://github.com/apache/pekko-connectors/discussions/814, 
> on its own cluster sharding is not sufficient to get exactly-once delivery: 
> when this file processing is interrupted for some reason, you need some way 
> to make sure you can decide whether you need to re-start/resume this 
> processing. You don't need any additional locking for this, but indeed making 
> the upload idempotent would help to solve this aspect.

I am planning to have a object store marker file per directory or entityId and 
have the status of all successfully processed files under the directory as a 
metadata to marker file during every poll it should compare the delta, 
whichever file got successfully processed will be ignored. [checkpoint] (Redis 
/ DynamoDB would be ideal but we have constraint, same with Pekko Persistance 
of relational DBs) 

> Actors are typically cheap, so in that sense the number of actors will not be 
> a performance bottleneck. However, if there are many files, I could imagine 
> you'd overload your system by starting too many uploads in parallel. You 
> could probably restrict the number of parallel uploads from your HDFS 
> scanning code.

As of now i am going with actor per directory with the entityId being directory 
path, but i am curious to know how to restrict the number of parallel uploads 
from your HDFS scanning code (any suggestions @raboof )

GitHub link: 
https://github.com/apache/pekko/discussions/1508#discussioncomment-10799719

----
This is an automatically sent email for notifications@pekko.apache.org.
To unsubscribe, please send an email to: 
notifications-unsubscr...@pekko.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@pekko.apache.org
For additional commands, e-mail: notifications-h...@pekko.apache.org

Reply via email to