Re: [GENERAL] Approach to Data Summary and Analysis

Gavin Flower Tue, 15 Apr 2014 18:29:23 -0700

On 16/04/14 13:10, Robert DiFalco wrote:

1. >500K rows per day into the calls table.

2. Very rarely. The only common query is gathering users that have notbeen called "today" (along with some other qualifying criteria). Moreanalytical queries/reports are done for internal use and it is notessential that they be lickity-split.

a. Usually just one connection at a time executes read queries.
b. the users not called today query will be done once a day.
c. Daily

d. All users for the last year (if you are asking about retention). Wewill also rarely have to run for all time.

e. Not that I'm aware of (or seen) today.

f. For the simple queries we cannot afford latency between calls andquerying who was already called.

While I don't seem to be getting much support for it here :D my writeperformance (which is most essential) has been much better since Ifurther normalized the tables and made it so that NULL is never usedand data is never updated (i.e. it is immutable once it is written).

As for wanting to avoid NULLs I don't really know what to say.Obviously some times NULL's are required. For this design I don'treally need them and they make the data harder to reason about(because they are kind of open to interpretation). They can also giveyou different results than you sometimes expect (for example whenlooking for a non matching key, you start having to inject some OR ISNULLs and such). Also, the absence of null can make a lot of queriesmore optimal). That said, I understand where you all are coming fromwith de-normalization. It's definitely the path of the leastresistance. Our instinct is to want to see all related data in asingle table when possible.

The summary table was really a separate point from whether or notpeople liked my schema or not -- I mean whether I de-normalize aspeople are asking or not, there would still be the question of asummary table for MAX and COUNT queries or to not have a summary tablefor those. I probably made the original question too open ended.

On Tue, Apr 15, 2014 at 3:26 PM, Roxanne Reid-Bennett <[email protected]<mailto:[email protected]>> wrote:


    On 4/14/2014 12:27 PM, Robert DiFalco wrote:

        And so on for calls_connected, calls_completed, call_errors, etc.

        Occasionally I will want to know things like "When was the
        last time a user answered a call" or "How many times has a
        user been called".
        ...

        Sometimes I might want to get this data for a whole bunch of
        users.
        ...

        So the other option is to create a call_summary table that is
        updated with triggers.
        ...

        My only issue with a summary table is that I don't want a
        bunch of null fields.
        ...

        But over time all fields would eventually become non-NULL.

        So that leads me to a summary table for EACH call state. This
        approach has the down side that it creates a lot of tables and
        triggers. It has the upside of being pretty efficient without
        having to deal with NULL values.  It's also pretty easy to
        reason about.
        ...

        So for my question -- is the choice between these a personal
        preference sort of thing or is there a right or wrong
        approach? Am I missing another approach that would be better?
         I'm okay with SQL but I'm not expert so I'm not sure if there
        is an accepted DESIGN PATTERN for this that I am missing.

    There is no right or wrong - there is better, worse, best, and
    worst for any specific scenario.  In my experience, most people
    have time/money to get to an 80% "better" design than all the
    other answers during design and then it gets refined over time.
     And yes, personal experience does play a part in how people
    interpret better/worse [aka religion] ;)

    I didn't see anybody ask these questions - and to identify
    "better" -  they have to be asked.
    1. How much data are you feeding into your system how fast?
           this directly affects your choices on distribution,
    parallel processing...  writes vs updates vs triggers for copying
    vs all reads
           [and if on bare metal - potentially where you place your
    logs, indexes, core lookup tables, etc]
    2. How much data are you reading out of your system - how fast?
           you have given "simple" use cases (how many calls completed
    within a time frame or to a number)
           you have given very slightly more complex use cases (when
    was the last time John answered a call)
           you have given a slightly more bulky processing question of
    (how many times have these users been called)
    So...
          a) How many users executing read queries do you have?
          b) What is the expected load for simple queries (per
    week/day/hour/minute - depending upon your resolution on speed)
          c) What is the expected load for your mid-line complex queries
          d) What is the "maximum" volume you expect a bulk query to
    go after (like all users in the last 18 years, or this city's
    users in the last day?)  and how frequently will that kind of
    query be executed?  How much tolerance for delay do your users have?
          e) do you have any known really complex queries that might
    bog the system down?
          f) How much lag time can you afford between capture and
    reporting?

    Answers to the above define your performance requirements - which
    defines the style of schema you need.  Queries can be written to
    pull data from any schema design - but how fast they can perform
    or how easily they can be created...

    Chris and Vincent both targeted a balance between writes and reads
    - which adequately answers 80-85% of the usages out there.  But
    you didn't give us any of the above - so their recommendation
    (while very likely valid) may not actually fit your case at all.

    As to design patterns -
    "Generally" a database schema is more normalized for an
    operational system because normalization results in fewer
    writes/updates and lowers the risk of corruption if a failure
    takes place. It also isolates updates for any specific value to
    one location minimizing internally caused data corruption.
    Reporting systems are generally less normalized because writes are
    more one-time and reads are where the load occurs.
    Sometimes you have to use data replication to have a system that
    appropriately supports both.

    you have shown you are already aware of normalization.
    If you weren't aware of approaches to Data Warehousing... you can
    review information about how it is accomplished
    - see the blogs on kimballgroup DOT com  they cover a lot of
    high(er) level concepts with enough specificity to be of some
    direct use.
    [that website also covers some ideas for "Big Data" which aren't
    necessarily limited to RDBMS']

    Specify your performance requirements, then figure out your schema
    design.

    FWIW I don't understand your (or any other person's) hesitancy for
    "lots of" "NULL" values.  They provide meaning in a number of
    different ways... not the least of which is that you don't know
    (yet) - which is knowledge in and of itself.

    Roxanne

--Sent via pgsql-general mailing list ([email protected]

    <mailto:[email protected]>)
    To make changes to your subscription:
    http://www.postgresql.org/mailpref/pgsql-general

Have you considered partial indexes? Using the /WHERE //predicate/option of /CREATE INDEX/.

This can be useful if you often look for things that are often only asmall subset of keys. For example a partial index on sex would usefulfor nurses, only indexing those that are male as they are in a verysmall minority.



Cheers,
Gavin

Re: [GENERAL] Approach to Data Summary and Analysis

Reply via email to