On 5/9/06, Aaron Sherman <[EMAIL PROTECTED]> wrote: > Steve Revilak wrote: [...] > > If you haven't normalized the pants out of the table, that's probably > > a good place to start. Anything you can do to trim a few bytes off > > the row size will make a big difference in the size of the overall > > table. (Is this table part of a data warehouse?). Smaller data will > > always be faster than bigger data. > > > > I don't mean any offense, but I could not disagree more with ... well, > every sentence. First off, normalization is the antithesis of > performance, as it almost always results in more joins. It's the right > thing to do in many cases, but think twice before "normalizing the pants > out of the table" for performance. It won't be a win.
Don't be so fast to say that it won't be a win. Because in a read-only environment, "normalizing the pants" out of the table can sometimes be a big win at the cost of a lot of pain. Long ramble alert. Many moons ago (1997) I was dealing with a fairly large data dump. (Well, large for the hardware and database. The database was Access 2, the target hardware was low-end desktop PCs running Windows 3.1, and the full data set was hundreds of thousands of rows. Gawds that sounds puny.) I went to the extreme of normalization - every text field (eg company, address, phone #, etc) was replaced by an ID that was chosen to fit in as few bytes as possible, and was ordered identically with the original data set. Then the actual values were put into a separate table. I then had to write every query as a pair of queries, the first query was against the small table of indexes plus any joins I needed for the where condition, and then I'd do group bys on the indexes, the second query just joined the result of the first query to all of the display fields, and then would order by the index fields if I needed a sort. (Given the nature of the application, all queries needed to do a group by.) Of course I wrote a query generator to generate this pair of queries. There is NO way that I wanted to do that by hand! (Actually it was 2 pairs of queries, there was a master-detail view.) With this schema, even the most demanding queries successfully ran faster than the original table could be scanned once! And the resulting database, uncompressed, was smaller than the original data file was compressed. Now this comparison is a bit unfair since I was comparing a completely non-normalized table with an insanely normalized one. A moderate amount of normalization would have achieved most of the win. (However it would have made the automatic query generator harder to write, and since the point of the application was to make a specific data set available to people who understood nothing about databases, the automatic query generator was key.) End of long ramble. > The row size is ignorable, unless you've constructed your data and your > indexes poorly. If you need to fetch every record, then of course, you > want to have as little data as possible, but if you just need to fetch > key records or compute summary information, then row size is not your > most important concern by far. What the magic figure is will vary depending on lots of things, including your hardware. But if you need more than, say, 5-7% of the records in a table, then it is usually faster to do a full table scan. If you have to do a full table scan, then row size matters. Whether or not you often need to access this much of the table is frequently more a matter of what questions your users are trying to answer from your data than how well you've designed it. Until you know what your users need to know, you don't know that. > Smaller data vs. bigger data is orders of magnitude less important than > less data vs. more data, and the way you reduce the AMOUNT of data is to > use indexes wisely. This reduces the amount of data that any given query > must interact with, rather than trying to reduce the data in the table. [...] Actual query that I wrote today. "How many people viewed how many different things last month?" That is, how many people viewed 1 thing last month (possibly several times)? How many people viewed 2 things last month? 3 things? etc. No index in the world that will avoid having to retrieve every record of a viewing for last month. (However, to be fair, retrieving that dataset is orders of magnitude less work than manipulating it into that form.) Cheers, Ben _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

