[SQL] RCF: 2nd draft: A brief guide to Nulls

dev Fri, 17 Jan 2003 07:07:45 -0800

Thanks all for the previous feedback. If no-one comes up with any errors
in this draft I'll call it finished.


- Richard Huxton

A Brief Guide to NULLs
======================
revision: 0.9
date:     2002-01-17
author:   Richard Huxton <[EMAIL PROTECTED]>

Overview
========
This is a short guide to the use of nulls in SQL databases. It is written
with Postgresql in mind but should be applicable to any SQL-based DBMS.

Thanks to the members of the psql-sql mailing list for their assistance in
preparing this guide.

You can get further information in:
Any good relational database book (try something written by Date or Pascal)
Bruce's book (link LHS at http://techdocs.postgresql.org)
My Postgresql Notes (link at http://techdocs.postgresql.org)


PART I - INTRODUCTION

What is a null?
===============
A null is *not* an empty string.
A null is *not* a value like others.
A null is the absence of a value[1].


What do nulls mean?
===================
Well, they *should* mean one of two things:
1. There is no applicable value
2. There is a value but it is unknown

Example 1: Imagine you have a customer table with name and sex fields.
If you get a new customer "ACME Widgets Ltd", the sex field is meaningless
since your customer is a company (case 1).
If you get a new customer "Jackie Smith" they might be male or female, but
you might not know (case 2).
Actually, since you are trying to store a company in the ACME example that
might indicate that you need to rethink your design.

Example 2: You have an address table with (street,city,county,postalcode)
fields.
You might insert an address ("10 Downing Street","London",Null,"WC1 1AA")
since you don't have a valid county.
You might also insert an address ("1 Any Street","Maidstone","Kent",Null)
where there *must be* a valid postalcode, but you don't know what it is.

It might be useful to be able to distinguish between these two cases - not
applicable and unknown, but there is only one option "Null" available to
us, so we can't.


How do nulls work?
==================
There is one very important rule when dealing with nulls. A null is
unknown and thus not equal to, less than or greater than any value it is
compared to.

Example: with the customer table above you could run the following queries:
  SELECT * FROM customer WHERE sex='M';
  SELECT * FROM customer WHERE sex<>'M';
Now you might think this returns all customers, but it will miss those
where sex is null. You've asked for all rows where the value of sex is 'M'
and all those with values not equal to 'M' but not rows with *no value at
all*

It might help to think of a database as a set of statements you *know* to
be true. A null indicates that you *cannot say anything at all* about that
field. You can't say what it is, you can't say what it isn't, you can only
say there is some information missing.

So, to see all the customers with unknown or inapplicable sex you would need:
  SELECT * FROM customer WHERE sex IS NULL;
Note that the following will not work, you need to use "IS NULL"
  SELECT * FROM customer WHERE sex=NULL;

There are actually three possible results for a test in SQL - True (the
test passed), False (the test failed) and Null (unknown or can't say). The
table below indicates the result of using AND/OR operations on a,b for
values of True,False and Null.

  a     | b     | a AND b | a OR b
  ------+-------+---------+--------
  TRUE  | TRUE  | TRUE    | TRUE
  TRUE  | FALSE | FALSE   | TRUE
  TRUE  | NULL  | NULL    | TRUE
  FALSE | FALSE | FALSE   | FALSE
  FALSE | NULL  | FALSE   | NULL
  NULL  | NULL  | NULL    | NULL

In the example of a=True,b=Null, (a AND b) is Null (which gets treated as
false for the purposes of WHERE clauses). However (a OR b) is True since
if a is True, we don't care what b is.

If you try to perform an operation on nulls, again the result is always
null. So the results of all of the following are null:
  SELECT 'abc' || null;
  SELECT 1 + null;
  SELECT sqrt(null::numeric);
The first case can be especially confusing. Concatenating a null string to
a string value will return null, not the original value. This can catch
you out if you are joining first_name to last_name and one of them
contains nulls.


How are nulls implemented?
==========================
You can think of each null-able field/column having a separate "is_null"
flag attached to it. So, if you have a column "a" of type integer, in
addition to space required to store the number, there is another bit which
says whether the item is null and the value should be ignored. Of course,
there are optimisations that get made, but that is the general idea.


PART II - IMPLICATIONS

Uniqueness and nulls
====================
If you define a unique index on a column it prevents you inserting two
values that are the same. It does not prevent you inserting as many nulls
as you like. How could it? You don't have a value so it can't be the same
as any other.

Example: We create a table "ta" with a unique constraint on column "b"
  CREATE TABLE ta (
    a int4,
        b varchar(3),
        PRIMARY KEY (a)
  );
  CREATE UNIQUE INDEX ta_b_idx ON ta (b);
  INSERT INTO ta VALUES (1,'aaa');  -- succeeds
  INSERT INTO ta VALUES (2,'bbb');  -- succeeds
  INSERT INTO ta VALUES (3,null);   -- succeeds
  INSERT INTO ta VALUES (4,'bbb');  -- fails
  INSERT INTO ta VALUES (5,null);   -- succeeds!

Given the definition of what a null is, you only have two choices: allow
multiple nulls or allow no nulls. If you want no nulls, define the column
as NOT NULL when creating the table.


Keys and nulls
==============
No column that is part of a primary key can be null. When you define a
PRIMARY KEY, none of the columns mentioned can take a null value.
Postgresql makes sure of this by defining the columns as NOT NULL for you.

Example: With table "ta" we just created, \d ta will show column a as
being not null. Otherwise, we could insert several rows with "a" set to
null and have no way to tell them apart. If the primary key was defined as
being over (a,b) then neither could be null.


Aggregates and nulls
====================
You need to be careful using count() if a column can contain nulls. The
count() function is defined as counting *values* and so skips nulls. The
same applies to other aggregates like sum() or max() but these behave more
intuitively.

Example: we have a table ta (a int4, b int4) with the following data.
   a | b
  ---+----
   1 | 10
   2 | 20
   3 | <null>

  SELECT count(*) as num_rows, count(a) as num_a, count(b) as num_b FROM ta;
   num_rows | num_a | num_b
  ----------+-------+-------
          3 |     3 |     2

  SELECT sum(b) FROM ta;
   sum
  -----
    30

If you were trying to calculate the average of column b then
sum(b)/count(b) gives a different result from sum(b)/count(*).


Subqueries and nulls
====================
You need to think carefully about how the above rules impact sub-queries,
especially something like NOT IN.

Example: Assume we have a companies table and a diary table. Diary entries
are usually related to a particular company but not always.

  SELECT co_id,co_name FROM companies;
  co_id |         co_name
 -------+--------------------------
      1 | Worldwide Partnership UK
      2 | British Associates PLC
      3 | Global Enterprises INC

  SELECT dy_id,dy_company FROM diary;
   dy_id | dy_company
  -------+------------
     101 |          1
     102 |          2
     103 |     <null>

  SELECT co_name FROM companies WHERE co_id IN (SELECT dy_company FROM
diary);
           co_name
  --------------------------
   Worldwide Partnership UK
   British Associates PLC

  SELECT co_name FROM companies WHERE co_id NOT IN (SELECT dy_company FROM
diary);
   co_name
  ---------
  (0 rows)

What happened in this last case? Where did co_id=3 go to - it's certainly
not in the diary. Actually, that's not true. Since there is a null in
dy_company, we can't say for sure whether 3 is there or not - that null
throws everything into a state of uncertainty.

Looking at what's happening in the case of co_id=3:
  WHERE co_id NOT IN (SELECT dy_company FROM diary)
  WHERE 3 NOT IN (1,2,Null)
  WHERE NOT (3=1   OR 3=2   OR 3=Null)
  WHERE NOT (False OR False OR Null)
  WHERE NOT (Null)
  WHERE Null

You can see that the OR-ing a Null with false values gives Null and
Not(Null)=Null. We can't prove that co_id=3 isn't in the diary so we can't
return that row.


PART III - Advice

A thought
=========
Remember that advice is worth exactly what you paid for it.


Where to use nulls
==================
Think carefully about whether a particular field/column should be able to
contain nulls. If not define it as NOT NULL when creating the table.

In cases where a column holds an enumerated type (like the sex column
previously) you can define your own values for not applicable and unknown.
So rather than values of (M,F) you could allow values of (M,F,N,U). This
makes it clear whether the value is NOT APPLICABLE or UNKNOWN.

If you have a numeric or free-text field then you can't define your own
replacements for null (except in the case when you can restrict
permissable values). It's also not necessarily a good idea. Using negative
or very large numbers to represent unknown is one of the reasons null was
invented.

If you have a text field, you need to decide what you store when the user
leaves it blank while using your application. Do you treat that as null or
an empty string? The answer, I'm afraid will depend on context but if a
field can be null you'll need a way to set/display it differently from an
empty string.


Nulls and normalisation
=======================
Using nulls to mean not applicable can indicate you haven't normalised
correctly.

Example 1: In the company/diary example earlier, we got tripped up with a
null in the dy_company column. We could avoid this by splitting into three
tables: companies, diary and co_diary with the last being the relevant
id's from the other two tables. This makes it simple to identify what
companies have/don't have diary entries.

Example 2: Assume we have a bird_description table to record sightings by
bird-spotters. You could structure it as (sighting_id, weight, wingspan,
belly_colour, wing_colour, wingtip_colour, beak_colour, feading_behaviour,
mating_behaviour, ...) but for any individual sighting most of these
values will be null.
An alternative would be to define the table as (sighting_id,aspect,value)
so you could store ("S0137","wingspan","15cm") - this means you are only
storing the information supplied and also makes it easier to add new
aspects of a sighting.


Footnotes
=========
[1] The SQL standard defines it as a "special value" but it's behaviour is
so different to all the other values of that type that I prefer to view it
as an absence of a value. If you don't see the difference, just ignore it.


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

[SQL] RCF: 2nd draft: A brief guide to Nulls

Reply via email to