Howdy all,

We maintain a hosted multi-tenant system for a large number of users. Each
user has what we call an "organization" or "org" which is their sandbox. In
that organization they can manage contacts and define custom fields on
those contacts, which then have values per contact.

We have thousands of organizations, some organizations have hundreds of
custom fields. Some organizations have millions of contacts. We have tens
of millions of values across organizations, so relatively sparse with a
large number of tiny / unused organizations. All these numbers will
continue to grow, though single organizations greater than 10 million
contacts is unlikely.

What is the recommended way of modeling this in order to allow performant
queries on the custom fields?

Right now we basically use an EAV model (a single contact_fields table)
with compound indexes that join the field key and field value so as to
force locality in the index. That has worked ok, but occasionally falls on
its face when the query planner makes a wrong guess due to the (obviously
skewed) statistics it has to work with. Multi-field queries can also get
painful, especially on the largest organizations with millions of contacts.

What other approaches should we be looking at? We've brainstormed different
approaches but would love some wisdom to help us narrow down what are
reasonable things to try testing out. Most of our designs hone in on
creating a table per organizations to hold field values and dynamically
creating indexes on that. The idea being that we won't kill our insert
performance as much by only having one index per field to check on inserts
and table statistics should still be ok. (how are statistics managed for
JSONB fields?) The main question we have is what is going to happen if we
have thousands (or tens of thousands) of tables on a single database? The
good news is the vast majority of our organizations are idle at any point
in time.

Approaches we've thought about and questions / pros / cons:

1) Add a JSONB field on our (shared across organizations) contact table,
store field values there. Create JSONB indexes per unique field.
   pros: nice having the locality of data on contact, multi field queries
are likely way better, query performance should be good
   cons: we have to create thousands of indexes? we have to use uuids as
keys to keep our indexes org-specific? insert performance suffers from
having thousands of partial indexes (how badly)?

2) Create a table per organization `contact_fields_[org_id]` containing a
column per field. Create columns and indexes per unique field.
   pros: locality is nice again, multi field queries are better, query and
insert performance should be good.
   cons: thousands of tables with up to 100 indexes per, is that going to
blow up?

3) Create a table per organization `contacts_fields_[org_id]` which
contains a `fields` JSONB column, Create JSONB indexes per unique field.
  pros: locality is good, multi field queries good, query performance
should be good. Adding and removing fields is a bit simpler than 2) case
above and naming can be a bit clearer than 1) as we don't have to worry
about multi-org key name collisions
  cons: same as 2) but with a JSONB flavor, hundreds of JSONB indexes on
thousands of tables, thousands of tables

4) Create a database per organization? Use 1) above
  pros: all data is localized, might see performance improvements
elsewhere, query and insert performance should be good
  cons: mother of a refactor :)  what happens with thousands of databases
on a single box? is this actually better from the perspective of getting to
offload currently inactive orgs?

What other approaches should we be considering? I know EAV is a pain,
especially in multi-tenant situations, but we'd love to hear success (and
failure) stories from the community on how they've dealt with these.

Cheers,

-Nic

Reply via email to