On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg <gspiegelb...@gmail.com> wrote:
> Hey all, > > Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time > has said not to have millions of tables. I too have long believed it until > recently. > > AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for > PGDATA. Over the weekend, I created 8M tables with 16M indexes on those > tables. Table creation initially took 0.018031 secs, average 0.027467 and > after tossing out outliers (qty 5) the maximum creation time found was > 0.66139 seconds. Total time 30 hours, 31 minutes and 8.435049 seconds. > Tables were created by a single process. Do note that table creation is > done via plpgsql function as there are other housekeeping tasks necessary > though minimal. > > No system tuning but here is a list of PostgreSQL knobs and switches: > shared_buffers = 2GB > work_mem = 48 MB > max_stack_depth = 4 MB > synchronous_commit = off > effective_cache_size = 200 GB > pg_xlog is on it's own file system > > There are some still obvious problems. General DBA functions such as > VACUUM and ANALYZE should not be done. Each will run forever and cause > much grief. Backups are problematic in the traditional pg_dump and PITR > space. Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing > it in my test case) are no-no's. A system or database crash could take > potentially hours to days to recover. There are likely other issues ahead. > > You may wonder, "why is Greg attempting such a thing?" I looked at > DynamoDB, BigTable, and Cassandra. I like Greenplum but, let's face it, > it's antiquated and don't get me started on "Hadoop". I looked at many > others and ultimately the recommended use of each vendor was to have one > table for all data. That overcomes the millions of tables problem, right? > > Problem with the "one big table" solution is I anticipate 1,200 trillion > records. Random access is expected and the customer expects <30ms reads > for a single record fetch. > You don't give enough details to fully explain the problem you're trying to solve. - Will records ever be updated or deleted? If so, what percentage and at what frequency? - What specifically are you storing (e.g. list of integers, strings, people's sex habits, ...)? Or more importantly, are these fixed- or variable-sized records? - Once the 1,200 trillion records are loaded, is that it? Or do more data arrive, and if so, at what rate? - Do your queries change, or is there a fixed set of queries? - How complex are the joins? The reason I ask these specific questions is because, as others have pointed out, this might be a perfect case for a custom (non-relational) database. Relational databases are general-purpose tools, sort of like a Swiss-Army knife. A Swiss-Army knife does most things passably, but if you want to carve wood, or butcher meat, or slice vegetables, you get a knife meant for that specific task. I've written several custom database-storage systems for very specific high-performance systems. It's generally a couple weeks of work, and you have a tailored performance and storage that's hard for a general-purpose relational system to match. The difficulty of building such a system depends a lot on the answers to the questions above. Craig > No data is loaded... yet Table and index creation only. I am interested > in the opinions of all including tests I may perform. If you had this > setup, what would you capture / analyze? I have a job running preparing > data. I did this on a much smaller scale (50k tables) and data load via > function allowed close to 6,000 records/second. The schema has been > simplified since and last test reach just over 20,000 records/second with > 300k tables. > > I'm not looking for alternatives yet but input to my test. Takers? > > I can't promise immediate feedback but will do my best to respond with > results. > > TIA, > -Greg > -- --------------------------------- Craig A. James Chief Technology Officer eMolecules, Inc. ---------------------------------