Hi,

I've just uploaded a Decision Tree module to CPAN.  It provides 
a pretty bare-bones implementation of the classic ID3 decision 
tree building algorithm.

I plan on speaking about this module as part of my presentation 
at TPC in July, so I might add a few more features to it if I 
decide it needs them in order to be shown in public.  See the 
"LIMITATIONS" section of the docs.

Since it's a new module, I thought I'd include the docs here.

  -Ken


NAME
     AI::DecisionTree - Automatically Learns Decision Trees

SYNOPSIS
       use AI::DecisionTree;
       my $dtree = new AI::DecisionTree;

       # A set of training data for deciding whether to play tennis
       $dtree->add_instance
         (attributes => {outlook     => 'sunny',
                         temperature => 'hot',
                         humidity    => 'high'},
          result => 'no');

       $dtree->add_instance
         (attributes => {outlook     => 'overcast',
                         temperature => 'hot',
                         humidity    => 'normal'},
          result => 'yes');

       ... repeat for several more instances, then:
       $dtree->train;

       # Find results for unseen instances
       my $result = $dtree->get_result
         (attributes => {outlook     => 'sunny',
                         temperature => 'hot',
                         humidity    => 'normal'});

DESCRIPTION
     The "AI::DecisionTree" module automatically creates 
so-called "decision
     trees" to explain a set of training data. A decision tree is 
a kind of
     categorizer that use a flowchart-like process for categorizing new
     instances. For instance, a learned decision tree might look like the
     following, which classifies for the concept "play tennis":

                        OUTLOOK
                        /  |  \
                       /   |   \
                      /    |    \
                sunny/  overcast \rainy
                    /      |      \
               HUMIDITY    |       WIND
               /  \       *no*     /  \
              /    \              /    \
         high/      \normal      /      \
            /        \    strong/        \weak
          *no*      *yes*      /          \
                             *no*        *yes*

     (This example, and the inspiration for the 
"AI::DecisionTree" module,
     come directly from Tom Mitchell's excellent book "Machine Learning",
     available from McGraw Hill.)

     A decision tree like this one can be learned from training data, and
     then applied to previously unseen data to obtain results that are
     consistent with the training data.

     The usual goal of a decision tree is to somehow encapsulate 
the training
     data in the smallest possible tree. This is motivated by an "Occam's
     Razor" philosophy, in which the simplest possible 
explanation for a set
     of phenomena should be preferred over other explanations. 
Also, small
     trees will make decisions faster than large trees, and they are much
     easier for a human to look at and understand. One of the 
biggest reasons
     for using a decision tree instead of many other machine learning
     techniques is that a decision tree is a much more scrutable decision
     maker than, say, a neural network.

     The current implementation of this module uses an extremely simple
     method for creating the decision tree based on the training 
instances.
     It uses an Information Gain metric (based on expected reduction in
     entropy) to select the "most informative" attribute at each 
node in the
     tree. This is essentially the ID3 algorithm, developed by J. 
R. Quinlan
     in 1986. The idea is that the attribute with the highest Information
     Gain will (probably) be the best attribute to split the tree 
on at each
     point if we're interested in making small trees.

METHODS
   new()

     Creates a new decision tree object.

   add_instance()

     Adds a training instance to the set of instances which will 
be used to
     form the tree. An "attributes" parameter specifies a hash of
     attribute-value pairs for the instance, and a "result" parameter
     specifies the result.

   train()

     Builds the decision tree from the list of training instances.

   get_result()

     Returns the most likely result (from the set of all results given to
     "add_instance()") for the set of attribute values given. An 
"attributes"
     parameter specifies a hash of attribute-value pairs for the 
instance. If
     the decision tree doesn't have enough information to find a 
result, it
     will return "undef".

   rule_statements()

     Returns a list of strings that describe the tree in rule-form. For
     instance, for the tree diagram above, the following list would be
     returned (though not necessarily in this order - the order is
     unpredictable):

       if outlook='rain' and wind='strong' -> 'no'
       if outlook='rain' and wind='weak' -> 'yes'
       if outlook='sunny' and humidity='normal' -> 'yes'
       if outlook='sunny' and humidity='high' -> 'no'
       if outlook='overcast' -> 'yes'

     This can be helpful for scrutinizing the structure of a tree.

     Note that while the order of the rules is unpredictable, the 
order of
     criteria within each rule reflects the order in which the 
criteria will
     be checked at decision-making time.

LIMITATIONS
     A few limitations exist in the current version. All of them could be
     removed in future versions - especially with your help. =)

   No continuous attributes

     In the current implementation, only discrete-valued attributes are
     supported. This means that an attribute like "temperature" can have
     values like "cool", "medium", and "hot", but using actual 
temperatures
     like 87 or 62.3 is not going to work. This is because the 
values would
     split the data too finely - the tree-building process would probably
     think that it could make all its decisions based on the exact
     temperature value alone, ignoring all other attributes, because each
     temperature would have only been seen once.

     The usual way to deal with this problem is for the 
tree-building process
     to figure out how to place the continuous attribute values 
into a set of
     bins (like "cool", "medium", and "hot") and then build the 
tree based on
     these bin values. Future versions of "AI::DecisionTree" may provide
     support for this. For now, you have to do it yourself.

   No support for noisy or inconsistent data

     Currently, the tree-building technique cannot handle 
inconsistent data.
     That is, if you have two contradictory training instances, 
the "train()"
     method will throw an exception. Future versions may allow 
inconsistent
     data, finding the decision tree that most closely matches the data
     rather than precisely matching it.

   No support for tree-trimming

     Most decision tree building algorithms use a two-stage 
building process
     - first a tree is built that completely fits the training 
data (or fits
     it as closely as possible if noisy data is supported), and 
then the tree
     is pruned so that it will generalize well to new instances. 
This might
     be done either by maximizing performance on a set of 
held-out training
     instances, or by pruning parts of the tree that don't seem 
like they'll
     be very valuable.

     Currently, we build a tree that completely fits the training 
data, and
     we don't prune it. That means that the tree may overfit the training
     data in many cases - i.e., you won't be able to see the 
forest for the
     trees (or, more accurately, the tree for the leaves).

     This is mainly a problem when you're using "real world" or 
noisy data.
     If you're using data that you know to be a result of a rule-based
     process and you just want to figure out what the rules are, 
the current
     implementation should work fine for you.

AUTHOR
     Ken Williams, [EMAIL PROTECTED]

SEE ALSO
     Mitchell, Tom (1997). Machine Learning. McGraw-Hill. pp 52-80.

     Quinlan, J. R. (1986). Induction of decision trees. Machine 
Learning,
     1(1), pp 81-106.

     the perl manpage.

Reply via email to