Thanks all for the previous feedback. If no-one comes up with any errors in this draft I'll call it finished.
- Richard Huxton A Brief Guide to NULLs ====================== revision: 0.9 date: 2002-01-17 author: Richard Huxton <[EMAIL PROTECTED]> Overview ======== This is a short guide to the use of nulls in SQL databases. It is written with Postgresql in mind but should be applicable to any SQL-based DBMS. Thanks to the members of the psql-sql mailing list for their assistance in preparing this guide. You can get further information in: Any good relational database book (try something written by Date or Pascal) Bruce's book (link LHS at http://techdocs.postgresql.org) My Postgresql Notes (link at http://techdocs.postgresql.org) PART I - INTRODUCTION What is a null? =============== A null is *not* an empty string. A null is *not* a value like others. A null is the absence of a value[1]. What do nulls mean? =================== Well, they *should* mean one of two things: 1. There is no applicable value 2. There is a value but it is unknown Example 1: Imagine you have a customer table with name and sex fields. If you get a new customer "ACME Widgets Ltd", the sex field is meaningless since your customer is a company (case 1). If you get a new customer "Jackie Smith" they might be male or female, but you might not know (case 2). Actually, since you are trying to store a company in the ACME example that might indicate that you need to rethink your design. Example 2: You have an address table with (street,city,county,postalcode) fields. You might insert an address ("10 Downing Street","London",Null,"WC1 1AA") since you don't have a valid county. You might also insert an address ("1 Any Street","Maidstone","Kent",Null) where there *must be* a valid postalcode, but you don't know what it is. It might be useful to be able to distinguish between these two cases - not applicable and unknown, but there is only one option "Null" available to us, so we can't. How do nulls work? ================== There is one very important rule when dealing with nulls. A null is unknown and thus not equal to, less than or greater than any value it is compared to. Example: with the customer table above you could run the following queries: SELECT * FROM customer WHERE sex='M'; SELECT * FROM customer WHERE sex<>'M'; Now you might think this returns all customers, but it will miss those where sex is null. You've asked for all rows where the value of sex is 'M' and all those with values not equal to 'M' but not rows with *no value at all* It might help to think of a database as a set of statements you *know* to be true. A null indicates that you *cannot say anything at all* about that field. You can't say what it is, you can't say what it isn't, you can only say there is some information missing. So, to see all the customers with unknown or inapplicable sex you would need: SELECT * FROM customer WHERE sex IS NULL; Note that the following will not work, you need to use "IS NULL" SELECT * FROM customer WHERE sex=NULL; There are actually three possible results for a test in SQL - True (the test passed), False (the test failed) and Null (unknown or can't say). The table below indicates the result of using AND/OR operations on a,b for values of True,False and Null. a | b | a AND b | a OR b ------+-------+---------+-------- TRUE | TRUE | TRUE | TRUE TRUE | FALSE | FALSE | TRUE TRUE | NULL | NULL | TRUE FALSE | FALSE | FALSE | FALSE FALSE | NULL | FALSE | NULL NULL | NULL | NULL | NULL In the example of a=True,b=Null, (a AND b) is Null (which gets treated as false for the purposes of WHERE clauses). However (a OR b) is True since if a is True, we don't care what b is. If you try to perform an operation on nulls, again the result is always null. So the results of all of the following are null: SELECT 'abc' || null; SELECT 1 + null; SELECT sqrt(null::numeric); The first case can be especially confusing. Concatenating a null string to a string value will return null, not the original value. This can catch you out if you are joining first_name to last_name and one of them contains nulls. How are nulls implemented? ========================== You can think of each null-able field/column having a separate "is_null" flag attached to it. So, if you have a column "a" of type integer, in addition to space required to store the number, there is another bit which says whether the item is null and the value should be ignored. Of course, there are optimisations that get made, but that is the general idea. PART II - IMPLICATIONS Uniqueness and nulls ==================== If you define a unique index on a column it prevents you inserting two values that are the same. It does not prevent you inserting as many nulls as you like. How could it? You don't have a value so it can't be the same as any other. Example: We create a table "ta" with a unique constraint on column "b" CREATE TABLE ta ( a int4, b varchar(3), PRIMARY KEY (a) ); CREATE UNIQUE INDEX ta_b_idx ON ta (b); INSERT INTO ta VALUES (1,'aaa'); -- succeeds INSERT INTO ta VALUES (2,'bbb'); -- succeeds INSERT INTO ta VALUES (3,null); -- succeeds INSERT INTO ta VALUES (4,'bbb'); -- fails INSERT INTO ta VALUES (5,null); -- succeeds! Given the definition of what a null is, you only have two choices: allow multiple nulls or allow no nulls. If you want no nulls, define the column as NOT NULL when creating the table. Keys and nulls ============== No column that is part of a primary key can be null. When you define a PRIMARY KEY, none of the columns mentioned can take a null value. Postgresql makes sure of this by defining the columns as NOT NULL for you. Example: With table "ta" we just created, \d ta will show column a as being not null. Otherwise, we could insert several rows with "a" set to null and have no way to tell them apart. If the primary key was defined as being over (a,b) then neither could be null. Aggregates and nulls ==================== You need to be careful using count() if a column can contain nulls. The count() function is defined as counting *values* and so skips nulls. The same applies to other aggregates like sum() or max() but these behave more intuitively. Example: we have a table ta (a int4, b int4) with the following data. a | b ---+---- 1 | 10 2 | 20 3 | <null> SELECT count(*) as num_rows, count(a) as num_a, count(b) as num_b FROM ta; num_rows | num_a | num_b ----------+-------+------- 3 | 3 | 2 SELECT sum(b) FROM ta; sum ----- 30 If you were trying to calculate the average of column b then sum(b)/count(b) gives a different result from sum(b)/count(*). Subqueries and nulls ==================== You need to think carefully about how the above rules impact sub-queries, especially something like NOT IN. Example: Assume we have a companies table and a diary table. Diary entries are usually related to a particular company but not always. SELECT co_id,co_name FROM companies; co_id | co_name -------+-------------------------- 1 | Worldwide Partnership UK 2 | British Associates PLC 3 | Global Enterprises INC SELECT dy_id,dy_company FROM diary; dy_id | dy_company -------+------------ 101 | 1 102 | 2 103 | <null> SELECT co_name FROM companies WHERE co_id IN (SELECT dy_company FROM diary); co_name -------------------------- Worldwide Partnership UK British Associates PLC SELECT co_name FROM companies WHERE co_id NOT IN (SELECT dy_company FROM diary); co_name --------- (0 rows) What happened in this last case? Where did co_id=3 go to - it's certainly not in the diary. Actually, that's not true. Since there is a null in dy_company, we can't say for sure whether 3 is there or not - that null throws everything into a state of uncertainty. Looking at what's happening in the case of co_id=3: WHERE co_id NOT IN (SELECT dy_company FROM diary) WHERE 3 NOT IN (1,2,Null) WHERE NOT (3=1 OR 3=2 OR 3=Null) WHERE NOT (False OR False OR Null) WHERE NOT (Null) WHERE Null You can see that the OR-ing a Null with false values gives Null and Not(Null)=Null. We can't prove that co_id=3 isn't in the diary so we can't return that row. PART III - Advice A thought ========= Remember that advice is worth exactly what you paid for it. Where to use nulls ================== Think carefully about whether a particular field/column should be able to contain nulls. If not define it as NOT NULL when creating the table. In cases where a column holds an enumerated type (like the sex column previously) you can define your own values for not applicable and unknown. So rather than values of (M,F) you could allow values of (M,F,N,U). This makes it clear whether the value is NOT APPLICABLE or UNKNOWN. If you have a numeric or free-text field then you can't define your own replacements for null (except in the case when you can restrict permissable values). It's also not necessarily a good idea. Using negative or very large numbers to represent unknown is one of the reasons null was invented. If you have a text field, you need to decide what you store when the user leaves it blank while using your application. Do you treat that as null or an empty string? The answer, I'm afraid will depend on context but if a field can be null you'll need a way to set/display it differently from an empty string. Nulls and normalisation ======================= Using nulls to mean not applicable can indicate you haven't normalised correctly. Example 1: In the company/diary example earlier, we got tripped up with a null in the dy_company column. We could avoid this by splitting into three tables: companies, diary and co_diary with the last being the relevant id's from the other two tables. This makes it simple to identify what companies have/don't have diary entries. Example 2: Assume we have a bird_description table to record sightings by bird-spotters. You could structure it as (sighting_id, weight, wingspan, belly_colour, wing_colour, wingtip_colour, beak_colour, feading_behaviour, mating_behaviour, ...) but for any individual sighting most of these values will be null. An alternative would be to define the table as (sighting_id,aspect,value) so you could store ("S0137","wingspan","15cm") - this means you are only storing the information supplied and also makes it easier to add new aspects of a sighting. Footnotes ========= [1] The SQL standard defines it as a "special value" but it's behaviour is so different to all the other values of that type that I prefer to view it as an absence of a value. If you don't see the difference, just ignore it. ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly