-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11-01-04 12:29 PM, Julian Edwards wrote: > Yeah, none of these are acceptable really, but if there's only a single > writer, writing single records in each transaction, then it will work as I > proposed.
Earlier, you said: > The timestamp would also need to live on each context record as well, of > course. Most of our data already has this. So I assumed that you intended to use our existing data as the context records. Am I mistaken? If we do use our existing data as context records, we will be adding new constraints on how we create our existing data (single writer, multiple transactions), and new failure modes. For example, if we break up code into multiple transactions, that could break an assumption that the entire operation either succeeds or fails. For another example, someone could come along later and, not understanding why there are multiple transactions, "optimize" the code into a single transaction. >> As I said, timestamps are an approximation of sequence, but we have >> genuine sequences for pretty much every table: integer ID columns. If >> you order the operations by database ID rather than by timestamp, then >> you can record the last ID completed, and there is no room for >> ambiguity. So I think it's simpler to use database IDs rather than >> timestamps. > > There's a couple of reasons I shied away from IDs: > 1. Timestamps are really useful to eyeball as opposed to IDs. Agreed. > 2. Simple integer IDs can overflow on a busy system I'm not sure what you mean. I'm sure we both agree that there are maximum values that an ID can have, depending on its integer type. However, AFAIK, none of our existing data exceeds the scope of the default 4-byte integer type. Only BranchRevision comes close. But maybe you're not talking about our existing data? If we have new data that our standard 4-byte integer can't handle, shouldn't we use BIGINT for its database ID? BIGINT can represent just as many discrete values as a timestamp, because both are eight-byte numbers. But because a BIGINT primary key is used sequentially, far fewer discrete values are wasted. Timestamps represent 4713 BCE to 5874897 CE with a 1-microsecond resolution, but it's doubtful that Launchpad will need to represent a range of more than 100 years, and the rest of the values are wasted. Or to look at it another way, let's assume we exhaust a BIGINT in a year. That's 584,942 records per microsecond: BIGINT_SIZE = pow(256, 8) YEAR_MICROSECONDS = (365 * 24 * 60 * 60 * 1000000) print BIGINT_SIZE / YEAR_MICROSECONDS 584942L So if we're so busy that BIGINT is inadequate, we will have hundreds of thousands of duplicate timestamps. Were you also suggesting that when we reach the maximum value of an ID column, we will overflow back to 1? I'm no SQL expert, but I cannot find any documentation that says they can. I've had a look at http://www.postgresql.org/docs/8.4/static/functions-sequence.html and it doesn't suggest they can. Even if BIGINT does wrap, it's likely to violate a unique constraint. Duplicate timestamps will be a silent failure, unless you also add a unique constraint to the timestamps. In any case, this is a risk we run with every table in the database. >> Your idea reminds me of two things: >> 1. DBLoopTuner >> 2. "micro-jobs" from >> https://dev.launchpad.net/Foundations/NewTaskSystem/Requirements >> >> Perhaps those could also provide inspiration? > > The idea that I want to encapsulate is the concept of atomically storing a > restart-point, which I can't find expressed in either of these. Sure, but that's not the only way to solve your use cases. These both provide design patterns that could be used for interruptible processing of continuous data. DBLoopTuner relies on the TuneableLoop's __call__() method to store the restart-point. So for example, lp.translations.scripts.verify_pofile_stats.Verifier uses self.start_id as the restart-point. Your idea is similar to a TuneableLoop, except that you want to store the restart point, and you want it to be explicitly a timestamp instead of having it be an implementation detail. To apply "micro-jobs" to this problem, you would represent each operation as a "micro-job". You would directly represent which jobs had been run and which ones had not. The specifics depend on how we end up implementing the new task system, but one obvious way would be to have a status like BuildStatus for each micro-job. > Another thing we could do is to manually add some microseconds on to the > timestamps I guess you mean changing the microseconds on the timestamps to ensure they are unique? That does not guarantee uniqueness unless you have a single writer that does not run more than once per microsecond. However, if we were really busy, that would be too slow. > or to add a "serial" column, If you add an integer column, it would make sense for it to refer to the database ids as its values. In this case, declaring it "serial" to Postgres wouldn't matter, because you'd be assigning arbitrary values to it. Of course, once you're using database ids, you don't need the timestamps anymore. If it used its own sequence to refer to the rows, then we would need to add a column to every supported table and assign a value to every row of those tables that could be referenced, which I think would get messy. > if they get encapsulated in a > different way. I've not thought in any depth how I'd do that though. > > Do you think that if we narrow down the constraints to single-writer, single > record per transaction, it would diminish the usefulness of this too much? > I'm fairly sure it would be OK in the cases I know of already. I think that it it's in conflict with your argument that "Simple integer IDs can overflow on a busy system", because that implies hundreds of thousands of writes per microsecond. I think that if we need an ordered list of unique identifiers, then it's much simpler to use integer IDs than timestamps. Aaron -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk0joZcACgkQ0F+nu1YWqI3mtACdF2n6Xe+bEJ7hhZJem3tZY5RL AYcAnjeOqRkDPKLyy3p/OIeRervy8N0f =iSKa -----END PGP SIGNATURE----- _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

