Re: [Boston.pm] LARGE MySQL Database

Ben Tilly Tue, 09 May 2006 19:53:01 -0700

On 5/9/06, Aaron Sherman <[EMAIL PROTECTED]> wrote:
> Steve Revilak wrote:
[...]
> > If you haven't normalized the pants out of the table, that's probably
> > a good place to start.  Anything you can do to trim a few bytes off
> > the row size will make a big difference in the size of the overall
> > table.  (Is this table part of a data warehouse?).  Smaller data will
> > always be faster than bigger data.
> >
>
> I don't mean any offense, but I could not disagree more with ... well,
> every sentence. First off, normalization is the antithesis of
> performance, as it almost always results in more joins. It's the right
> thing to do in many cases, but think twice before "normalizing the pants
> out of the table" for performance. It won't be a win.


Don't be so fast to say that it won't be a win.  Because in a
read-only environment, "normalizing the pants" out of the table can
sometimes be a big win at the cost of a lot of pain.

Long ramble alert.

Many moons ago (1997) I was dealing with a fairly large data dump.
(Well, large for the hardware and database.  The database was Access
2, the target hardware was low-end desktop PCs running Windows 3.1,
and the full data set was hundreds of thousands of rows.  Gawds that
sounds puny.)  I went to the extreme of normalization - every text
field (eg company, address, phone #, etc) was replaced by an ID that
was chosen to fit in as few bytes as possible, and was ordered
identically with the original data set.  Then the actual values were
put into a separate table.  I then had to write every query as a pair
of queries, the first query was against the small table of indexes
plus any joins I needed for the where condition, and then I'd do group
bys on the indexes, the second query just joined the result of the
first query to all of the display fields, and then would order by the
index fields if I needed a sort.  (Given the nature of the
application, all queries needed to do a group by.)

Of course I wrote a query generator to generate this pair of queries.
There is NO way that I wanted to do that by hand!  (Actually it was 2
pairs of queries, there was a master-detail view.)

With this schema, even the most demanding queries successfully ran
faster than the original table could be scanned once!  And the
resulting database, uncompressed, was smaller than the original data
file was compressed.

Now this comparison is a bit unfair since I was comparing a completely
non-normalized table with an insanely normalized one.  A moderate
amount of normalization would have achieved most of the win.  (However
it would have made the automatic query generator harder to write, and
since the point of the application was to make a specific data set
available to people who understood nothing about databases, the
automatic query generator was key.)

End of long ramble.

> The row size is ignorable, unless you've constructed your data and your
> indexes poorly. If you need to fetch every record, then of course, you
> want to have as little data as possible, but if you just need to fetch
> key records or compute summary information, then row size is not your
> most important concern by far.

What the magic figure is will vary depending on lots of things,
including your hardware.  But if you need more than, say, 5-7% of the
records in a table, then it is usually faster to do a full table scan.
 If you have to do a full table scan, then row size matters.  Whether
or not you often need to access this much of the table is frequently
more a matter of what questions your users are trying to answer from
your data than how well you've designed it.  Until you know what your
users need to know, you don't know that.

> Smaller data vs. bigger data is orders of magnitude less important than
> less data vs. more data, and the way you reduce the AMOUNT of data is to
> use indexes wisely. This reduces the amount of data that any given query
> must interact with, rather than trying to reduce the data in the table.
[...]

Actual query that I wrote today.  "How many people viewed how many
different things last month?"  That is, how many people viewed 1 thing
last month (possibly several times)?  How many people viewed 2 things
last month?  3 things?  etc.  No index in the world that will avoid
having to retrieve every record of a viewing for last month.
(However, to be fair, retrieving that dataset is orders of magnitude
less work than manipulating it into that form.)

Cheers,
Ben
 
_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] LARGE MySQL Database

Reply via email to