On 16/04/14 13:10, Robert DiFalco wrote:
1. >500K rows per day into the calls table.
2. Very rarely. The only common query is gathering users that have not
been called "today" (along with some other qualifying criteria). More
analytical queries/reports are done for internal use and it is not
essential that they be lickity-split.
a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily
d. All users for the last year (if you are asking about retention). We
will also rarely have to run for all time.
e. Not that I'm aware of (or seen) today.
f. For the simple queries we cannot afford latency between calls and
querying who was already called.
While I don't seem to be getting much support for it here :D my write
performance (which is most essential) has been much better since I
further normalized the tables and made it so that NULL is never used
and data is never updated (i.e. it is immutable once it is written).
As for wanting to avoid NULLs I don't really know what to say.
Obviously some times NULL's are required. For this design I don't
really need them and they make the data harder to reason about
(because they are kind of open to interpretation). They can also give
you different results than you sometimes expect (for example when
looking for a non matching key, you start having to inject some OR IS
NULLs and such). Also, the absence of null can make a lot of queries
more optimal). That said, I understand where you all are coming from
with de-normalization. It's definitely the path of the least
resistance. Our instinct is to want to see all related data in a
single table when possible.
The summary table was really a separate point from whether or not
people liked my schema or not -- I mean whether I de-normalize as
people are asking or not, there would still be the question of a
summary table for MAX and COUNT queries or to not have a summary table
for those. I probably made the original question too open ended.
On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <r...@tara-lu.com
<mailto:r...@tara-lu.com>> wrote:
On 4/14/2014 12:27 PM, Robert DiFalco wrote:
And so on for calls_connected, calls_completed, call_errors, etc.
Occasionally I will want to know things like "When was the
last time a user answered a call" or "How many times has a
user been called".
...
Sometimes I might want to get this data for a whole bunch of
users.
...
So the other option is to create a call_summary table that is
updated with triggers.
...
My only issue with a summary table is that I don't want a
bunch of null fields.
...
But over time all fields would eventually become non-NULL.
So that leads me to a summary table for EACH call state. This
approach has the down side that it creates a lot of tables and
triggers. It has the upside of being pretty efficient without
having to deal with NULL values. It's also pretty easy to
reason about.
...
So for my question -- is the choice between these a personal
preference sort of thing or is there a right or wrong
approach? Am I missing another approach that would be better?
I'm okay with SQL but I'm not expert so I'm not sure if there
is an accepted DESIGN PATTERN for this that I am missing.
There is no right or wrong - there is better, worse, best, and
worst for any specific scenario. In my experience, most people
have time/money to get to an 80% "better" design than all the
other answers during design and then it gets refined over time.
And yes, personal experience does play a part in how people
interpret better/worse [aka religion] ;)
I didn't see anybody ask these questions - and to identify
"better" - they have to be asked.
1. How much data are you feeding into your system how fast?
this directly affects your choices on distribution,
parallel processing... writes vs updates vs triggers for copying
vs all reads
[and if on bare metal - potentially where you place your
logs, indexes, core lookup tables, etc]
2. How much data are you reading out of your system - how fast?
you have given "simple" use cases (how many calls completed
within a time frame or to a number)
you have given very slightly more complex use cases (when
was the last time John answered a call)
you have given a slightly more bulky processing question of
(how many times have these users been called)
So...
a) How many users executing read queries do you have?
b) What is the expected load for simple queries (per
week/day/hour/minute - depending upon your resolution on speed)
c) What is the expected load for your mid-line complex queries
d) What is the "maximum" volume you expect a bulk query to
go after (like all users in the last 18 years, or this city's
users in the last day?) and how frequently will that kind of
query be executed? How much tolerance for delay do your users have?
e) do you have any known really complex queries that might
bog the system down?
f) How much lag time can you afford between capture and
reporting?
Answers to the above define your performance requirements - which
defines the style of schema you need. Queries can be written to
pull data from any schema design - but how fast they can perform
or how easily they can be created...
Chris and Vincent both targeted a balance between writes and reads
- which adequately answers 80-85% of the usages out there. But
you didn't give us any of the above - so their recommendation
(while very likely valid) may not actually fit your case at all.
As to design patterns -
"Generally" a database schema is more normalized for an
operational system because normalization results in fewer
writes/updates and lowers the risk of corruption if a failure
takes place. It also isolates updates for any specific value to
one location minimizing internally caused data corruption.
Reporting systems are generally less normalized because writes are
more one-time and reads are where the load occurs.
Sometimes you have to use data replication to have a system that
appropriately supports both.
you have shown you are already aware of normalization.
If you weren't aware of approaches to Data Warehousing... you can
review information about how it is accomplished
- see the blogs on kimballgroup DOT com they cover a lot of
high(er) level concepts with enough specificity to be of some
direct use.
[that website also covers some ideas for "Big Data" which aren't
necessarily limited to RDBMS']
Specify your performance requirements, then figure out your schema
design.
FWIW I don't understand your (or any other person's) hesitancy for
"lots of" "NULL" values. They provide meaning in a number of
different ways... not the least of which is that you don't know
(yet) - which is knowledge in and of itself.
Roxanne
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Have you considered partial indexes? Using the /WHERE //predicate/
option of /CREATE INDEX/.
This can be useful if you often look for things that are often only a
small subset of keys. For example a partial index on sex would useful
for nurses, only indexing those that are male as they are in a very
small minority.
Cheers,
Gavin