Re: [PERFORM] Millions of tables

Craig James Tue, 27 Sep 2016 07:37:20 -0700

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelb...@gmail.com>
wrote:


> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.  Backups are problematic in the traditional pg_dump and PITR
> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>

You don't give enough details to fully explain the problem you're trying to
solve.

   - Will records ever be updated or deleted? If so, what percentage and at
   what frequency?
   - What specifically are you storing (e.g. list of integers, strings,
   people's sex habits, ...)? Or more importantly, are these fixed- or
   variable-sized records?
   - Once the 1,200 trillion records are loaded, is that it? Or do more
   data arrive, and if so, at what rate?
   - Do your queries change, or is there a fixed set of queries?
   - How complex are the joins?

The reason I ask these specific questions is because, as others have
pointed out, this might be a perfect case for a custom (non-relational)
database. Relational databases are general-purpose tools, sort of like a
Swiss-Army knife. A Swiss-Army knife does most things passably, but if you
want to carve wood, or butcher meat, or slice vegetables, you get a knife
meant for that specific task.

I've written several custom database-storage systems for very specific
high-performance systems. It's generally a couple weeks of work, and you
have a tailored performance and storage that's hard for a general-purpose
relational system to match.

The difficulty of building such a system depends a lot on the answers to
the questions above.

Craig


> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>



-- 
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Re: [PERFORM] Millions of tables

Reply via email to