You could always attach the file's last modification time at the time of queue insertion into the beanstalk message. Then when the job is retrieved, have the queue consumer check to make sure the mtime of the file isn't later than the time associated with the job. If it is, then you know that there was a later modification to the file while this job was waiting around to get taken, so this job can be skipped (because the later modification should have processed it). This will mean that you still have duplicate jobs in the queue, but only the last one added will actually take significant resources to perform *and* you know that the computations that are performed are against the latest version of the file. Of course, if there are other processes that have write access into your file hierarchy, a stray modification that wasn't associated with an added job could result in the file not getting indexed at all, as all the automated indexing jobs deferred themselves to the one associated with the most recent change, which didn't get added into the queue.
The problem could also be solved other ways (including with filename semaphores or other pieces of state stored in the job body). The key thing is that you probably don't care if you have multiple jobs in the queue for the same file, you just want to make sure that only one of them actually performs the calculation (and presumably, you want the calculation to be against the latest version). Another approach that doesn't suffer from the problem of errant writes to the filesystem is for you to store an indication of file version (this could be as simple as the file's mtime) in the index database. When a job is picked up from the queue, the indexer checks to see if the current index's mtime is the same as the file's. If it is, then the indexer knows that another indexer has already taken care of this file's version, and thus the index job can immediately go and pick a new message off the queue. -- William On Mon, Oct 11, 2010 at 3:06 PM, Ron Mayer <[email protected]> wrote: > I'm using beanstalkd to queue up jobs to re-index documents whenever they > get updated; so my jobs in this case are all simple paths/urls to documents. > > For some documents that change faster than the queue is drained, I end up > getting the same job in the queue dozens of times; and then doing extra work > re-processing them unnecessarily. > > Is there a good way to say "put this in the queue if it's not already in > there"? > > > If not, does anyone have a good way of handling this outside of beanstalkd? > > I'm considering adding something to memcached saying > "document file://whatever is in the queue = true" > whenever I enqueue one; check for that flag before adding it again; > and remove the flag when I process it; > but was wondering if there's an easier/better/more conventional way. > > -- > You received this message because you are subscribed to the Google Groups > "beanstalk-talk" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/beanstalk-talk?hl=en. > > -- You received this message because you are subscribed to the Google Groups "beanstalk-talk" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/beanstalk-talk?hl=en.
