On Fri, 12 Sep 2003, Ala Qumsieh wrote:

> On Fri, 12 Sep 2003, Mark Kvale wrote:
> 
> > a) AI::NeuralNet::Mesh - trains up multi-layer perceptrons, a type of
> > feedforward neural net. It has good documentation. For your problem, I
> > would reccommend a 3 layer net, with one input, one hidden and one
> > output layer, with tanh activation fuctions.
> 
> Perhaps a stupid question, but since we're on the subject of ANNs:
> 
> What is a good criterion for choosing the number of nodes per layer? I
> haven't been up-to-date with ANN literature lately, but I recall reading
> that a 3-layer network should suffice most applications. Is that true?
> 
> As for the nodes per layer, I would assume the input layer would have as
> many nodes as input variables, and the output layer will have as many
> nodes as the output var. What about the hidden layer(s)?

None of these are stupid questions.

Regarding multi-layer perceptrons, the original and most common form
of feedforward network, there are a few heuristic rules that people
use to pare down the space of possible neural architectures.

For the input layer, a continuous scalar variable is assigned to one
input. For a categorical variable with C possible values, people
typically use a 1 of (C-1) encoding scheme. If I wanted to encode
weather as good, bad, ugly, and duck!, I would use three inputs:
good  0  0  0
bad   1  0  0
ugly  0  1  0
duck! 0  0  1
If you have enogh categories and they all seem to lie along the same
axis of measurment, you might try converting it to a numeric variable:
good  0
bad   1
ugly  5
duck! 10

The output layer is similar: one oputput for each numeric variable,
but a 1 to C encoding scheme for a categorical variable:
good  1 0 0 0
bad   0 1 0 0
ugly  0 0 1 0
duck! 0 0 0 1

For the hidden layer, typically people start with a single hidden
layer. As you say, it is suffucuent for many purposes. There is some
theorem that says you can approximate any function with a sufficient
number of hidden units in a single layer, but that may be a lot of
hidden units! If it works, a single layer is nice because one can look
at the pattern of weights and deduce which factors you through at the
problem might be most important. 

Sometimes if the function to be fit is sufficienly complex, people
might try two hidden layers, as this may reduce the total number of
hidden units used. Reducing the number of hidden units is good because
it reduces the number of parameters that must be learned, and thus the
amount of data meeded to do a good job. As Einstein says, 'A theory
should be as simple as possible, but no simpler'. Works for computer
programs and neural nets, too.

Ok, so one tries a single hidden layer first, and then maybe two
layers. But how many units per layers should be used? Some people have
created heuristics like there should be one hidden unit per M input
lines, etc., but these are all crap. No such prescription is
universally good over all possible problems. There is too much
variety.

The only reliable method for optimizing your architecture is to try it
out! That is, use the method of cross validation I mentioned in my
first email. By testing the NN on data that it has not been trained
on, you'll get a good idea of how it works on real world data. There
are three regimes of behavior you will encounter:
1) too few nodes - there isn't enough computing capacity in the NN to
model the complexity of the data, resulting in a high error rate in
the test set.
2) too many nodes - this NN captures all the complexity of the
underlying process, but also has enough capacity to fit all the noise
and random artifacts of your particular training set. Fitting noise
will produce answers that are off base on your test set, because the
NN is in effect taking into account spurious causes of the output.
3) Just the right number of nodes, not too complex, not too simple.

As one progresses through too few nodes, to just right, to too many,
the eroor function will typically start out high, decrease fast to a
minimum, and then rise slowly. Because the error function itself may
be noisy (too few test samples, it's just a noisy system, etc.) I find
it best to plot the error as a function of nodes and eyeball the
minimum. 

--
Mark Kvale, neurobiophysicist
http://www.keck.ucsf.edu/~kvale/


Reply via email to