On Wed, 25 Oct 2006, Bruce Momjian wrote: ...snip... > > > Data partitioning is often done within a single database on a single > > server and therefore, as a concept, has nothing whatsoever to do with > > different servers. Similarly, the second paragraph of this section is > > Uh, why would someone split things up like that on a single server? > > > problematic. Please define your term first, then talk about some > > implementations - this is muddying the water. Further, there are both > > vertical and horizontal partitioning - you mention neither - and each has > > its own distinct uses. If partitioning is mentioned, it should be more > > complete. > > Uh, what exactly needs to be defined.
OK, "Data partitioning"; data partitioning begins in the RDB world with the very notion of tables, and we partition our data during schema development with the goal of "normalizing" the design - "thrid normal form" being the one most Professors talk about as a target. "Data partitioning", then, is the intentional denormalization of the design to accomplish some goal(s) - not all of which are listed in this document's title. In this context, data partitioning takes two forms based upon which axis of a two-dimensional table is to be divided, with the vertical partition dividing attributes (as in a master/detail relationship with one-to-one mapping), and the horizontal partition dividing based on one or more attributes domain, or value (as in your example of London records being kept in a database in London, while Paris records are kept in Paris). The point I was making was that that section of the document was in err because it presumed there was only one form of data partitioning and that it was horizontal. (The document is now missing, so I can't look at the current content - it was here: ftp://momjian.us/pub/postgresql/mypatches/replication.) In answer to your query about why someone would use such partitioning, the nearly universal answer is performance, and the distant second answer is security. In one example that comes immediately to mind, there is a table which is a central core of an application, and, as such, there's a lot to say about the items in this table. The table's size is in the tens to hundreds of millions of rows, and needs to be joined with something else in a huge fraction of queries. For performance reasons, the tables size was therefore kept as tiny as possible and detail table(s) is(are) used for the remaining attributes that logically belong in the table - it's a vertical partition. It's an exceptionally common technique - so common, it probably didn't occur to you that you were even talking about it when you spoke of "data partitioning." > > Next, Query Broadcast Load Balancing... also needs a lot of work. First, > > it's foremost in my memory that sending read queries everywhere and > > returning the first result set back is a key way to improve application > > performance at the cost of additional load on other systems - I guess > > that's not at all what the document is after here, but it's a worthy part > > of a dialogue on broadcasting queries. In other words, this has more parts > > to it than just what the document now entertains. Secondly, the document > > Uh, do we want to go into that here? I guess I could. > > > doesn't address _at_all_ whether this is a two-phaise-commit environment > > or not. If not, how are updates managed? If each server operates > > independently and one of them fails, what do you do then? How do you know > > _any_ server got an insert/update? ... Each server _can't_ operate > > independently unless the application does its own insert/update commits to > > every one of them - and that can't be fast, nor does it load balance, > > though it may contribute to superior uptime performance by the > > application. > > I think having the application middle layer do the commits is how it > works now. Can someone explain how pgpool works, or should we mention > how two-phase commit has to be done here? pgpool2 has additional > features. Well, you hadn't mentioned two phaise commit at all and it surely belong somewhere in this document - it's a core PG feature and enables a lot of alternative solutions which the document discusses. What it needs to say but doesn't (didn't?) is that the load from read queries can be distributed for load balancing purposes but that there's no benefit possible for writes, and that replication overhead costs could possibly overwhelm the benefits in high-update scenarios. The point that each server operates independently is only true if you ignore the the necessary replication - which, to my mind, links the systems and they are not independent. ...I suppose that in a completely read-only environment - or updated nightly by dumping tarwads or something like that, they could be considered independent, but it's hardly worth the sentence. Regards, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 [EMAIL PROTECTED], http://ScienceTools.com/ ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly