On Tuesday 04 January 2011 16:24:46 Aaron Bentley wrote: > > In a previous life, the context data that I've used for this is a > > timestamp, and it worked very well in pretty much all cases I came > > across. The client application simply provides the same timestamp to a > > query/api call from the last item it processed, and the data continues > > to flow from where it left off. This ticked all the boxes for data > > integrity and polling or streaming usage. > > Timestamps are an approximation of sequence, because sometimes there are > multiple rows with the same timestamp. This is not unlikely, because > the multiple rows may created as part of a single transaction.
Right, I'd forgotten that the implementation I previously worked on automatically appended a guaranteed-unique counter to the timestamp micro- (or maybe nano) seconds part. > Because there's room for ambiguity, a process such as you describe could > 1. refuse to stop while the next row to be processed has the same > timestamp as the current row. > 2. stop in the middle and when it starts again, skip the remainder of > items with the same timestamp. > 3. stop in the middle and when it starts again, start from the first > item with that timestamp. > 4. ? > > 1. could work, if it's not essential that we stop immediately. > 2. is usually undesirable, but can sometimes be fixed up by a second > cron job that detects that the work still needs to be done. > 3. could work, if running the operation twice for a given item doesn't > do any harm. However, we could get stopped again, and again, and > again, and never finish running the operation on all rows. Yeah, none of these are acceptable really, but if there's only a single writer, writing single records in each transaction, then it will work as I proposed. > As I said, timestamps are an approximation of sequence, but we have > genuine sequences for pretty much every table: integer ID columns. If > you order the operations by database ID rather than by timestamp, then > you can record the last ID completed, and there is no room for > ambiguity. So I think it's simpler to use database IDs rather than > timestamps. There's a couple of reasons I shied away from IDs: 1. Timestamps are really useful to eyeball as opposed to IDs. 2. Simple integer IDs can overflow on a busy system > Your idea reminds me of two things: > 1. DBLoopTuner > 2. "micro-jobs" from > https://dev.launchpad.net/Foundations/NewTaskSystem/Requirements > > Perhaps those could also provide inspiration? The idea that I want to encapsulate is the concept of atomically storing a restart-point, which I can't find expressed in either of these. Another thing we could do is to manually add some microseconds on to the timestamps or to add a "serial" column, if they get encapsulated in a different way. I've not thought in any depth how I'd do that though. Do you think that if we narrow down the constraints to single-writer, single record per transaction, it would diminish the usefulness of this too much? I'm fairly sure it would be OK in the cases I know of already. J _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

