Re: [PERFORM] Compare rows

Greg Spiegelberg Wed, 08 Oct 2003 12:13:30 -0700

Josh Berkus wrote:

Greg,
The data represents metrics at a point in time on a system for
network, disk, memory, bus, controller, and so-on.  Rx, Tx, errors,
speed, and whatever else can be gathered.
We arrived at this one 642 column table after testing the whole
process from data gathering, methods of temporarily storing then
loading to the database.  Initially, 37+ tables were in use but
the one big-un has saved us over 3.4 minutes.
Hmmm ... if few of those columns are NULL, then you are probably right ... this is probably the most normalized design. If, however, many of columns are NULL the majority of the time, then the design you should be using is a vertial child table, of the form ( value_type | value ).

Such a vertical child table would also make your comparison between instances *much* easier, as it could be executed via a simple 4-table-outer-join and 3 where clauses. So even if you don't have a lot of NULLs, you probably want to consider this.

You lost me on that one. What's a "vertical child table"?

Statistically, about 6% of the rows use more than 200 of the columns,
27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
columns and the remaining 22% of the rows use 39 or less of the columns.
That is a lot of NULLS.  Never gave that much thought.

To ensure query efficiency, hide the NULLs and simulate the multiple
tables I have a boatload of indexes, ensure that every query makees use
of an index, and have created 37 views.  It's worked pretty well so
far

The reason for my initial question was this.  We save changes only.
In other words, if system S has row T1 for day D1 and if on day D2
we have another row T1 (excluding our time column) we don't want
to save it.
If re-designing the table per the above is not a possibility, then I'd suggest that you locate 3-5 columns that: 1) are not NULL for any row; 2) combined, serve to identify a tiny subset of rows, i.e. 3% or less of the table.

There are always, always, always 7 columns that contain data.

Then put a multi-column index on those columns, and do your comparison. Hopefully the planner should pick up on the availablity of the index and scan only the rows retrieved by the index. However, there is the distinct possibility that the presence of 637 WHERE criteria will confuse the planner, causing it to resort to a full table seq scan; in that case, you will want to use a subselect to force the issue.

That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <> (d1,d2,...,d637) clause. Ugly.

Or, as Joe Conway suggested, you could figure out some kind of value hash that uniquely identifies your rows.


I've given that some though and though appealing I don't think I'd care
to spend the CPU cycles to do it.  Best way I can figure to accomplish
it would be to generate an MD5 on each row without the timestamp and
store it in another column, create an index on the MD5 column, generate
MD5 on each line I want to insert.  Makes for a simple WHERE...

Okay. I'll give it a whirl. What's one more column, right?

Greg

--
Greg Spiegelberg
 Sr. Product Development Engineer
 Cranel, Incorporated.
 Phone: 614.318.4314
 Fax:   614.431.8388
 Email: [EMAIL PROTECTED]
Cranel. Technology. Integrity. Focus.

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [PERFORM] Compare rows

Reply via email to