You could always attach the file's last modification time at the time
of queue insertion into the beanstalk message. Then when the job is
retrieved, have the queue consumer check to make sure the mtime of the
file isn't later than the time associated with the job. If it is, then
you know that there was a later modification to the file while this
job was waiting around to get taken, so this job can be skipped
(because the later modification should have processed it). This will
mean that you still have duplicate jobs in the queue, but only the
last one added will actually take significant resources to perform
*and* you know that the computations that are performed are against
the latest version of the file. Of course, if there are other
processes that have write access into your file hierarchy, a stray
modification that wasn't associated with an added job could result in
the file not getting indexed at all, as all the automated indexing
jobs deferred themselves to the one associated with the most recent
change, which didn't get added into the queue.

The problem could also be solved other ways (including with filename
semaphores or other pieces of state stored in the job body). The key
thing is that you probably don't care if you have multiple jobs in the
queue for the same file, you just want to make sure that only one of
them actually performs the calculation (and presumably, you want the
calculation to be against the latest version). Another approach that
doesn't suffer from the problem of errant writes to the filesystem is
for you to store an indication of file version (this could be as
simple as the file's mtime) in the index database. When a job is
picked up from the queue, the indexer checks to see if the current
index's mtime is the same as the file's. If it is, then the indexer
knows that another indexer has already taken care of this file's
version, and thus the index job can immediately go and pick a new
message off the queue.

  -- William

On Mon, Oct 11, 2010 at 3:06 PM, Ron Mayer <[email protected]> wrote:
> I'm using beanstalkd to queue up jobs to re-index documents whenever they
> get updated; so my jobs in this case are all simple paths/urls to documents.
>
> For some documents that change faster than the queue is drained, I end up
> getting the same job in the queue dozens of times; and then doing extra work
> re-processing them unnecessarily.
>
> Is there a good way to say "put this in the queue if it's not already in 
> there"?
>
>
> If not, does anyone have a good way of handling this outside of beanstalkd?
>
> I'm considering adding something to memcached saying
> "document file://whatever is in the queue = true"
> whenever I enqueue one; check for that flag before adding it again;
> and remove the flag when I process it;
> but was wondering if there's an easier/better/more conventional way.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "beanstalk-talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/beanstalk-talk?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"beanstalk-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/beanstalk-talk?hl=en.

Reply via email to