RE: [jBoss-User] container generated primary key for CMP EntityEJB

Tom Cook Tue, 03 Oct 2000 23:16:53 -0700
Before we return to the fray, let me point out that my signature is
misleading.  My boss gave me the title of 'software engineer', so I
use it.  In reality my only qualifications are electrical, and
currently I am in software operations.  This, clearly, does not
qualify me as an expert in this field, so take what I say as my
opinion, often me thinking out loud.  I also am ready to be convinced,
but I am not ready to accept a position blindly.  Perhaps this makes
this the wrong forum for this discussion; if so, someone'll probably
let us know soon enough with a (metaphorical) well aimed brick.

[EMAIL PROTECTED] writes:
 > On Oct 4, Tom Cook quoth:
[snip]
 > I found this to be so
 > full of assumptions I felt it necessary to point out just how many of
 > these assumptions would have to be true in order for this to be the case.

Fair.

 > > It just seems to me (and I am no expert in the field) that it makes
 > > sense to have your database tightly linked to the key generator for
 > > that database. 
 > 
 > Your assertion here (and correct me if this isn't what the above sentence
 > says), databases & their key generators should be tightly linked.

Yes.

 > I'm
 > interpreting this to mean "makes sense to have your [EJBs] tightly linked
 > to the key generator for [the] database [they are stored in]".

Not quite.  While I am ready to support databases and key generators
being linked, I am not ready to break encapsulation in quite this
way.  EJBs should not be dependant on the _implementation_ of a key
generator, only it's interface.  I am merely advocating the use of a
database-generated key as a good implementation.  This does, of
course, mean that no entity beans can be created if the database goes
down, but the database going down will cause other problems which will
manifest themselves before this one.  And if your database is in a
clustered configuration and fails over, the sequence (or whatever)
used for key generation goes with it.

[snip agreement on databases being behind EJBs]
 > > Of course
 > > having one database generate keys for another is not sensible since,
 > > as you point out, if your key generator goes down then your other
 > > databases are stuffed. 
 > 
 > Agreed, however your Container is the transaction engine and therefore
 > ultimately responsible for the behavior of both.  Recall that your
 > proposal was for the database to be the key generator for the container
 > which the beans would call upon for new keys.

Beans which are stored in that database, yes.  It could even be
considered a good move to have separate generators for each table,
thereby maximizing your usage of the key space, but I haven't really
thought hard about that idea, so don't take it as something I hold to.

 > > You may well argue that we want to be able to move this to any
 > > database backend;
 > 
 > Corrollary: flexibility to deal with unforseen future requirements is a
 > good thing.  Agreed 100% and I would argue that except in the most trivial
 > of applications this is essential.

Yes, but the OO paradigm's way of coping with this is through
abstraction.  So the interface to our key generator is independant of
the implementation.  

[snip]

I've just snipped the entire bit about enterprise applications not
moving much.  This is not because I can't answer it and am ignoring
it, but because it was a rash line of argument which will not hold up
and I retract it.

[snip request for a supporting argument]

Position one - database generated keys.
---------------------------------------

The high word of the key is obtained from a database and the low word
is maintained in an internal sequence.  Note that here a word is not
some fixed or system dependant size (ie. not necessarily 16 or 32
bits).

pros:
  (1) Keys are guaranteed to be unique (until key-space rollover).
                        The database makes getting a new value from a sequence and
                        incrementing the sequence an atomic operation, guaranteeing 
that
                        you can not get a duplicate high word.  The key generator can
                        then make getting a new low word and incrementing the internal
                        sequence atomic (using a lock object) and we have guaranteed
                        unique keys.
  (2) Makes data consistency/uniqueness the responsibility of the data
                        store.
                        Although this may be considered an unnecessary system
                        dependancy, it may also be considered a good piece of
                        encapsulation.  Now if we can insert into the database then we
                        can create keys, and if we can't insert into the database then
                        we can't create keys.  There seems little point in making your
                        key generator independant of the database, since the _only_ use
                        for the key generator is in generating keys for the database.
        (3) High performance.
                        For n-bit words, you need to access the database every 2^n 
calls
                        to the key generation function.  Other than this, all you have
                        to do is an addition and comparison on each call - maybe 20
                        instructions if your compiler is hopeless.
  (4) Simple and clean implementation.
                        It's a fifty-liner if you're sloppy with it.
  (5) Makes maximum use of the key space.
                        This may sound like I like being pedantic about efficiency, but
                        this is a real concern.  Take your system where you append a
                        micro-second resolution timestamp to your key.  Say your system
                        scores, on average, one million inserts in a day.  There are
                        eighty six thousand million (86,000,000,000) micro-seconds in a
                        day.  You have just wasted 85,999,000,000 keys in your
                        keyspace.  At this rate, you will waste 85,999/86000 of your
                        keys, or 99.9988% of your keys.  This means that, for every row
                        in your database, there are sixteen and a half wasted bits just
                        in the timestamp. Over one day that's about 16Mb of wasted 
space
                        in your database. Over one year, that's a gigabyte.  It adds 
up,
                        and this is a site under fairly heavy usage (in one year it 
will
                        collect 365 million records).  On a lightly loaded site the
                        wastage will be much worse.

cons:
        (1) Database failure will make keys unavailable.
                        The significance of this is very debatable, since absence of 
the
                        database tends to make primary key generation an un-necessary
                        operation, unless you have some sort of caching mechanism which
                        hopes that the database will come back up before it runs out of
                        memory.
  (2) Sequences are implemented differently on different databases,
                        making your key generator non-portable to another database.
                        Re-implementing a key generator is 50 lines of code if you're
                        being sloppy about it.  The new guy, the one you don't know 
what
                        to do with yet, he should be able to do this in a few minutes 
if
                        you're documentation's worth the bandwidth used to download it.
  (3) Sequences are not implemented on every database.
                        This is an admitted deficiency.  However, while not wanting to
                        start a flame war I'm sure we've all read many times before, 
most
                        databases worth their salt have one.  (Note that my opinions on
                        which databases are worth their salt are rather restrictive, 
but
                        why not use the best?)

Position two - system information keys.
---------------------------------------

The key is constructed from the concatenation of:
                - the IP address of the host
                - process id
                - a timestamp
                - a serial number
                - some random number (optional)

pros:
  (1) Independant of other systems.
                        This method does not require any other system to be available
                        which can not be absolutely assumed; if there's an O/S there,
                        it'll run.
  (2) It may be considered architecturally more elegant to keep the
                        key generator conceptually separated from the data store.

cons:
  (1) Not guaranteed unique.
                        A non-unique key may be generated in the case of clock
                        reset/overflow or serial number overflow within the resolution
                        of the timestamp.  This may sound unlikely, but the point is
                        that it is not a guaranteed unique key that is generated.  The
                        chances of duplicate keys being generated is greatly increased
                        if you have multiple key generators running in a single process
                        (but different threads).  The tacking on of a random number to
                        decrease the likelyhood of duplicate keys looks like a tacky
                        way of patching up an algorithm that someone looked at and
                        decided they weren't quite sure about.
  (2) Performance hit.
                        Each key generation requires the acquisition of a timestamp and
                        the     generation of a random number.  Random number 
generation, in
                        particular, is not always a light-weight activity.
  (3) Poor utilization of key space.
                        See pro #4 for position one.
  (4) Java implementation difficulties.
                        Since people seem to do this I guess it's possible, but I have
                        yet to come across a platform independant way of getting the
                        current process' ID without implementing a JNI method which
                        calls getpid() from the standard C library. Indeed, the notion
                        of a process id is about as portable as the notion of a 
database
                        sequence.

I think this is a pretty fair sort of comparison; feel free to come up
with your own.  The 'system information keys' method is a bit light-on
in the pros department, but this might just be because I haven't seen
the light yet.

In response to your slightly personal attack regarding my advocating
this position in a public forum, please note that I am not the only
one to have suggested it, indeed I was not the originator of the idea.
I merely pointed out that we use it and have been defending it since.

Aaron Mulder made the suggestion originally, in post 03531, Rickard
Oberg suggested it as an EJB in post 03566, and there is a similar,
though simpler, method outlined on www.theserverside.com in their
patterns page (note that, to make this one guaranteed unique it
suffers a performance hit).

 > --------------------------------------------------------------------------
 >   Some people mistake the positions I take with the beliefs that I hold.
 > --------------------------------------------------------------------------

Don't you hate that?

Regards
Tom
-- 
Tom Cook - Software Engineer

"We rarely find that people have good sense unless they agree
 with us."
                - Francois, Duc de la Rochefoucauld

LISAsoft Pty Ltd - www.lisa.com.au

--------------------------------------------------
38 Greenhill Rd.          Level 3, 228 Pitt Street
Wayville, SA, 5034        Sydney, NSW, 2000

Phone:   +61 8 8272 1555  Phone:   +61 2 9283 0877
Fax:     +61 8 8271 1199  Fax:     +61 2 9283 0866
--------------------------------------------------


--
--------------------------------------------------------------
To subscribe:        [EMAIL PROTECTED]
To unsubscribe:      [EMAIL PROTECTED]
Problems?:           [EMAIL PROTECTED]
RE: [jBoss-User] container generated primary key for CMP EntityEJB

Reply via email to