On Fri, 12 Sep 2003, Ala Qumsieh wrote: > On Fri, 12 Sep 2003, Mark Kvale wrote: > > > a) AI::NeuralNet::Mesh - trains up multi-layer perceptrons, a type of > > feedforward neural net. It has good documentation. For your problem, I > > would reccommend a 3 layer net, with one input, one hidden and one > > output layer, with tanh activation fuctions. > > Perhaps a stupid question, but since we're on the subject of ANNs: > > What is a good criterion for choosing the number of nodes per layer? I > haven't been up-to-date with ANN literature lately, but I recall reading > that a 3-layer network should suffice most applications. Is that true? > > As for the nodes per layer, I would assume the input layer would have as > many nodes as input variables, and the output layer will have as many > nodes as the output var. What about the hidden layer(s)?
None of these are stupid questions. Regarding multi-layer perceptrons, the original and most common form of feedforward network, there are a few heuristic rules that people use to pare down the space of possible neural architectures. For the input layer, a continuous scalar variable is assigned to one input. For a categorical variable with C possible values, people typically use a 1 of (C-1) encoding scheme. If I wanted to encode weather as good, bad, ugly, and duck!, I would use three inputs: good 0 0 0 bad 1 0 0 ugly 0 1 0 duck! 0 0 1 If you have enogh categories and they all seem to lie along the same axis of measurment, you might try converting it to a numeric variable: good 0 bad 1 ugly 5 duck! 10 The output layer is similar: one oputput for each numeric variable, but a 1 to C encoding scheme for a categorical variable: good 1 0 0 0 bad 0 1 0 0 ugly 0 0 1 0 duck! 0 0 0 1 For the hidden layer, typically people start with a single hidden layer. As you say, it is suffucuent for many purposes. There is some theorem that says you can approximate any function with a sufficient number of hidden units in a single layer, but that may be a lot of hidden units! If it works, a single layer is nice because one can look at the pattern of weights and deduce which factors you through at the problem might be most important. Sometimes if the function to be fit is sufficienly complex, people might try two hidden layers, as this may reduce the total number of hidden units used. Reducing the number of hidden units is good because it reduces the number of parameters that must be learned, and thus the amount of data meeded to do a good job. As Einstein says, 'A theory should be as simple as possible, but no simpler'. Works for computer programs and neural nets, too. Ok, so one tries a single hidden layer first, and then maybe two layers. But how many units per layers should be used? Some people have created heuristics like there should be one hidden unit per M input lines, etc., but these are all crap. No such prescription is universally good over all possible problems. There is too much variety. The only reliable method for optimizing your architecture is to try it out! That is, use the method of cross validation I mentioned in my first email. By testing the NN on data that it has not been trained on, you'll get a good idea of how it works on real world data. There are three regimes of behavior you will encounter: 1) too few nodes - there isn't enough computing capacity in the NN to model the complexity of the data, resulting in a high error rate in the test set. 2) too many nodes - this NN captures all the complexity of the underlying process, but also has enough capacity to fit all the noise and random artifacts of your particular training set. Fitting noise will produce answers that are off base on your test set, because the NN is in effect taking into account spurious causes of the output. 3) Just the right number of nodes, not too complex, not too simple. As one progresses through too few nodes, to just right, to too many, the eroor function will typically start out high, decrease fast to a minimum, and then rise slowly. Because the error function itself may be noisy (too few test samples, it's just a noisy system, etc.) I find it best to plot the error as a function of nodes and eyeball the minimum. -- Mark Kvale, neurobiophysicist http://www.keck.ucsf.edu/~kvale/