On Tuesday, April 2, 2013, Carsten Ziegeler wrote: > Hi, > > I'm currently prototyping enhancements in the Sling job handling to allow > for better scaling in clustered / distributed environments. The current > implementation makes a lot of assumptions and relies on JCR locks. These > assumptions in combination of problems with JCR locks, usually lead to a > setup of having just a single instance in a cluster for processing jobs. > The goal is to be able to run jobs distributed in a cluster but also to be > able to process jobs only on specifc instances (e.g. to offload some heavy > jobs on dedicated machines). > > Though this is still in an early phase, I would like to run some of the > potential changes for users through this list > > a) Jobs containing queue configurations > The configuration of job handling is usually done through queue > configurations. These queues are assigned to one or more job topic and have > different characteristics like if these jobs can be processed in parallel, > how often a job should be retried, delay between retries etc. The queue's > are configured globally through OSGi ConfigAdmin and are therefore the same > on all cluster nodes. > When we started with the job handling, we didn't have this configuration, > so each and every job contained this whole information as properties of the > job itself - which clearly is a maintenance nightmare but can also lead to > funny situations where two jobs with the same topic contain different > configurations (e.g. one allowing parallel processing while the other does > not). > With the introduction of the queue configurations, we already reduced the > per job configuration possibilities and in some cases these are already > ignored. > > For the new version I plan to discontinue the per job configuration of > queue's as it is simply not worth the effort to support it. And having a > single truth of queue configurations makes maintenance and troubleshooting > way easier. > > b) Job API > Until now, we're leveraging the EventAdmin to add jobs but also to execute > jobs. While this seemed elegant when we started with job handling, this > creates another layer to the picture and adds a some uncertainty: e.g. a > job could be added by sending an event to the event admin, but is not known > to the sender whether this job really arrived at the job manager and/or got > persisted at all. On the other hand implementing a job processor based on > event admin looks more complicated than it should be. > > Therefore I think it's time to add a method to the JobManager for adding a > job - if this method returns, the job is persisted and gets executed. For > processing, we make the job processor interface a OSGi service interface. > Implementations can register this service together with the topics this > interface is able to process. This makes the implementation easier but also > allows to find out which topics can be processed on a cluster node. > > c) Deprecate event admin based API > As with b) we don't need the event admin based API anymore and should > deprecate it - but of course for compatiblity still support it. > > WDYT?
I think the event based processing in the current implementation nicely decouples the processing of jobs, but the implementation lacks a reliable distributed queue and so, is bound to a single node, crucially limiting scalability. The concept, not the mile mention, remind me of some extreemly scalable BPM implementations. Rather that attempting to internalise that within a jobmanager implementation, have you considered addressing a distributed queue, or reusing an off the shelf component that has been proven? Ian > > Regards > Carsten > -- > Carsten Ziegeler > [email protected] <javascript:;> >
