Re: Quesion about querying distributed databases

Kevin Stephenson Sat, 29 Mar 2025 08:14:21 -0700

Bumping this old thread to clarify a few points.

As an initial note on terminology, a "server" can be called a server, an 
instance, a server instance, or in PostgreSQL's (PG hereafter) case, a "DB 
Cluster." They all are used interchangeably in the wild to mean the same thing, 
with perhaps "server" being the most ambiguous. And that thing is a running 
process that is listening to (typically) a single port and handing connections 
off to individual threads or processes (or some other construct). The comments 
below use the word "instance" for consistency and brevity.


MariaDB
As mentioned by others, MariaDB (and MySQL proper, both hereafter referred to 
as MySQL) instances only support a single database. The CREATE DATABASE 
statement is a de facto synonym for CREATE SCHEMA (it is disappointing that 
MySQL and MariaDB docs still do not mention this significant fact). To disabuse 
those who believe MySQL support multiple databases, query any of the 
INFORMATION_SCHEMA tables and for all those that have the *_CATALOG column 
(CATALOG is a synonym for DATABASE), notice every row in every table that shows 
a catalog column, they all say "def", as in default. Further, the identifier 
used in CREATE DATABASE will show up in INFORMATION_SCHEMA tables in *_SCHEMA 
columns.

MySQL supports one and two-part object naming for DML and it supports foreign 
data access using the Federated (deprecated in MariaDB) and FederatedX 
(MariaDB) storage engines. One-part names use the currently selected schema 
(via USE statement) along with <object_name>. Two-part names are 
<schema_name>.<object_name>. Foreign data (e.g. tables) are also accessed using 
two-part names and there are limitations on what queries can be used with them 
(I believe they can be organized in a way the one-part names would work as 
well). I'm unaware of how advanced the pushdown optimizations have become, but 
the concept of pushdown is critical to understand both for the MySQL storage 
engines as well as for the Federated* "foreign data wrappers" (FDW).

To summarize, all MySQL instances have a single database called "def" and there 
is no way to create another database on a single instance at this time (MySQL 
9.2, MariaDB 11.8). To have more than one database currently requires running 
additional instances of MySQL on a different port. The idea that there are 
"high performance cross-database queries" in MySQL is simply incorrect at this 
time. There are high-performance cross-schema queries in MySQL, as there are in 
Microsoft SQL Server (MSSQL hereafter) and PG.

Note: MariaDB is planning on supporting real databases in 12.0 and they will be 
called CATALOGs, as some other RDMSs do. See:
https://mariadb.com/kb/en/create-catalog/

It's a big TBD on how well this will perform with InnoDB and other storage 
engines for cross-catalog queries (if that will even possible).

MSSQL (On-prem, VM, Azure SQL Managed Instance)
MSSQL supports one, two, three and four-part object naming.

One part names are the <object_name> and use the user's currently selected 
database (via USE <database_name> command or as set in the connection string), 
along with the user's default schema as a prefixes, and that schema prefix is 
typically dbo (similar to public in PG) to resolve object names (which behind 
the scenes are always three-part names: 
<database_name>.<schema_name>.<object_name>). The default schema can be 
changed, similar to the PG search path being changed..

Two-part names are <schema_name>.<object_name> and again use the user's 
currently selected database to get the fully qualified three-part name.

Where MSSQL differs from both MySQL and PG is in the support for three-part 
names. These are <database_name>.<schema_name>.<object_name>, and you can in 
fact run real "cross-database" MSSQL queries without FDW trickery. 
Historically, I believe there were some light limitations on their usage, such 
as in Foreign Keys and Triggers, but that may no longer be the case.

Linked Servers are the MSSQL near equivalent of FDW and are used in four-part 
names: <linked server name>.<database_name>.<schema_name>.<object_name>.

Two other MSSQL concepts are CONTAINED DATABASES and Azure SQL Database (with 
"Elastic Queries" as FDW, shockingly in "preview" for about a decade now). 
Those are beyond the scope of this email.

PG
On the PG side (which I know the least about), it does have real databases with 
schemas, and does support FDW. But since it uses a shared WAL (at least MSSQL 
separates logs per database, unclear on InnoDB, et. al.), and it uses a shared 
security system, I don't understand why it doesn't support cross-database 
queries. But that question would be best asked in a separate thread (any maybe 
it's already in the archives anyways). Putting all relevant objects in a single 
database with different schemas has already been mentioned as a workaround.

As far as some of the other back and forth on database (and system) 
architecture, that's really an enormous (and specialty) topic. Having 
previously worked on one of the largest MSSQL installations in the world, I 
have not seen anything in PG that would prevent a proper scaling strategy as 
long as "DB as API" was not being attempted, but rather microservices or some 
other pattern was used, and frequent massive JOINs cross-instance or 
cross-server (actual servers) weren't being attempted. All three platforms 
discussed here will have performance problems with that.

Perhaps the original poster would benefit from reviewing one of the gold 
standards on this topic, "Designing Data-Intensive Applications: The Big Ideas 
Behind Reliable, Scalable, and Maintainable Systems"
https://www.amazon.com/dp/1449373321

Thanks,
Kevin

________________________________
From: Ron Johnson <[email protected]>
Sent: Thursday, March 6, 2025 10:16 AM
To: pgsql-generallists.postgresql.org <[email protected]>
Subject: Re: Quesion about querying distributed databases

On Thu, Mar 6, 2025 at 10:47 AM Igor Korot 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

On Thu, Mar 6, 2025, 7:32 AM Greg Sabino Mullane 
<[email protected]<mailto:[email protected]>> wrote:
On Wed, Mar 5, 2025 at 9:44 PM me nefcanto 
<[email protected]<mailto:[email protected]>> wrote:
Anyway, that's why I asked you guys. However, encouraging me to go back to 
monolith without giving solutions on how to scale, is not helping.

We did. In addition to the ongoing FDW discussion, I mentioned read-only 
replicas and Citus. As far as *how* to scale vertically, we can offer general 
advice (more hardware resources, ramdisks for temp stuff, OS-level tuning, 
separate disk mounts). But a lot of it is tuning Postgres for your specific 
situation and your specific bottlenecks. Which we are happy to help with. Once 
we convince you to not throw the baby out with the bathwater. :)

8 Terabytes of data. A single backup took us more than 21 days

Something was fundamentally wrong there.

It could happen on an old and drained hardware... 😀

8TB databases existed 20+ years ago.  Like always, the hardware must fit the 
application.

21 days to backup a database absolutely means many things were improperly sized 
and configured.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Re: Quesion about querying distributed databases

Reply via email to