Re: [HACKERS] Replication documentation addition

Richard Troy Fri, 27 Oct 2006 12:47:20 -0700

On Wed, 25 Oct 2006, Bruce Momjian wrote:

   ...snip...
>
> > Data partitioning is often done within a single database on a single
> > server and therefore, as a concept, has nothing whatsoever to do with
> > different servers. Similarly, the second paragraph of this section is
>
> Uh, why would someone split things up like that on a single server?
>
> > problematic. Please define your term first, then talk about some
> > implementations - this is muddying the water. Further, there are both
> > vertical and horizontal partitioning - you mention neither - and each has
> > its own distinct uses. If partitioning is mentioned, it should be more
> > complete.
>
> Uh, what exactly needs to be defined.


OK, "Data partitioning"; data partitioning begins in the RDB world with
the very notion of tables, and we partition our data during schema
development with the goal of "normalizing" the design - "thrid normal
form" being the one most Professors talk about as a target. "Data
partitioning", then, is the intentional denormalization of the design to
accomplish some goal(s) - not all of which are listed in this document's
title. In this context, data partitioning takes two forms based upon which
axis of a two-dimensional table is to be divided, with the vertical
partition dividing attributes (as in a master/detail relationship with
one-to-one mapping), and the horizontal partition dividing based on one or
more attributes domain, or value (as in your example of London records
being kept in a database in London, while Paris records are kept in
Paris).

The point I was making was that that section of the document was in err
because it presumed there was only one form of data partitioning and that
it was horizontal. (The document is now missing, so I can't look at the
current content - it was here:
ftp://momjian.us/pub/postgresql/mypatches/replication.)

In answer to your query about why someone would use such partitioning, the
nearly universal answer is performance, and the distant second answer is
security. In one example that comes immediately to mind, there is a table
which is a central core of an application, and, as such, there's a lot to
say about the items in this table. The table's size is in the tens to
hundreds of millions of rows, and needs to be joined with something else
in a huge fraction of queries.  For performance reasons, the tables size
was therefore kept as tiny as possible and detail table(s) is(are) used
for the remaining attributes that logically belong in the table - it's a
vertical partition. It's an exceptionally common technique - so common, it
probably didn't occur to you that you were even talking about it when you
spoke of "data partitioning."

> > Next, Query Broadcast Load Balancing... also needs a lot of work. First,
> > it's foremost in my memory that sending read queries everywhere and
> > returning the first result set back is a key way to improve application
> > performance at the cost of additional load on other systems - I guess
> > that's not at all what the document is after here, but it's a worthy part
> > of a dialogue on broadcasting queries. In other words, this has more parts
> > to it than just what the document now entertains. Secondly, the document
>
> Uh, do we want to go into that here?  I guess I could.
>
> > doesn't address _at_all_ whether this is a two-phaise-commit environment
> > or not. If not, how are updates managed? If each server operates
> > independently and one of them fails, what do you do then? How do you know
> > _any_ server got an insert/update? ...  Each server _can't_ operate
> > independently unless the application does its own insert/update commits to
> > every one of them - and that can't be fast, nor does it load balance,
> > though it may contribute to superior uptime performance by the
> > application.
>
> I think having the application middle layer do the commits is how it
> works now.  Can someone explain how pgpool works, or should we mention
> how two-phase commit has to be done here?  pgpool2 has additional
> features.

Well, you hadn't mentioned two phaise commit at all and it surely belong
somewhere in this document - it's a core PG feature and enables a lot of
alternative solutions which the document discusses.

What it needs to say but doesn't (didn't?) is that the load from read
queries can be distributed for load balancing purposes but that there's no
benefit possible for writes, and that replication overhead costs could
possibly overwhelm the benefits in high-update scenarios. The point that
each server operates independently is only true if you ignore the the
necessary replication - which, to my mind, links the systems and they are
not independent. ...I suppose that in a completely read-only environment -
or updated nightly by dumping tarwads or something like that, they could
be considered independent, but it's hardly worth the sentence.

Regards,
Richard

-- 
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
[EMAIL PROTECTED], http://ScienceTools.com/


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Re: [HACKERS] Replication documentation addition

Reply via email to