Hello All, I tried the bank-sentiment analysis problem with POSES/MOSES. I also got the inference results as mentioned in the txt document. But i don't understand it.
The goal is to learn a model that predicts the column Q3, a value between 1 and 11. the command to make inference out of trained model, $ poses inference -m bank.model -p Q3 -c bankcase.hcs bankcase.hcs has some words and its count in the text. great : 1 fee : 0 they : 0 best : 3 servic : 1 After running this command, I thought each word will be classified and get a specific category (between 1-11). But i got the result as only one centroid value 10.5. What does this mean? The data used for training (bank.dat) has no such category 10.5 in the column Q3. Then Does all the words in the .hcs file belong to same category?. since the output is 10.5, Which category the words belong to 10 or 11? How should i interpret? Thanks in advance, Vishnu -- You received this message because you are subscribed to the Google Groups "opencog" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/opencog. To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/eb063f9b-83fd-4a1b-b0a6-27f4a96e9275%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Tutorial for the Banking Sentiment Dataset
------------------------------------------
June 2012
The Banking sentiment dataset is a compilation of customer satisfaction
survey data. It is a digested version of answers both to multiple-choice
questions, and free-form write-in comments. The goal is to correlate
the text of the comments with the answers to the multiple-choice
questions.
1. Installation of PM
---------------------
See README.txt in the main directory.
2. Convert the survey spreadsheet into a .dat file
--------------------------------------------------
This step has already been done; the result of conversion can be found
in the file "bank.dat" in this directory. The raw spreadsheet data and
conversion scripts can be found in the "example-bank-xls" directory.
Due to its large size, not all distributions of PM will include this
raw dataset.
3. Apply Poulin-MOSES on the banking data
-----------------------------------------
PM can perform inference and provide other kinds of insights into the
data. Before doing so, a model of the data must be learned. This first
step is the training step.
The goal is to learn a model that predicts Q3, a value between 1 and
11, representing customer satisfaction, given free-form comments left
by the customer.
3.1. Training
-------------
The following command will learn a model that predicts Q3:
$ poses moses -d bank.dat -m bank.model -p Q3 -e 10000 -n 11 -F "-ainc -C15
-j4" -M "-Z1 -j4"
The flags used are:
-d bank.dat indicates the file containing the training data
-m bank.model indicates the file where to write the model
-p Q3 is the target feature to predict
-e 10000 indicates that MOSES will run 10000 evaluations per classifier
As a general rule, the larger this number, the more accurate the
resulting model. It would not be unusual to specify counts of
half-a-million, resulting in run-times of hours or days.
-n 11 indicates the number of levels into which the target feature
will be partitioned into. Here, 11 levels correspond to ten
binary classifiers, respectively, attempting to determine
when the target variable Q3 exceeds each of ten thresholds:
Q3>=2, Q3>=3, ..., Q3>=11 For this small dataset, this many
levels is far too large, as will be seen in the accuracy
command, later.
-F "-ainc -C15 -j4" is a set of options to be passed to
feature-selection.
-ainc indicates which feature selection algorithm to use;
here, the incremental mutual-information algo is used.
-C15 specifies that fifteen of the most relevant, non-redundant
features should be selected during learning.
-j4 specifies that 4 jobs (threads) should be used.
-M "-Z1 -j4" is a set of options to be passed to MOSES.
-Z1 specifies that the cross-over search feature be enabled.
This can often (but not always) improve search performance
or results. The 'moses' command has many other options that
can improve performance, depending on the dataset
characteristics. See the 'moses' man page for details.
-j4 is the number of threads to use, here 4.
We can look at the model that was learned:
$ head bank.model -n30
<ensemble>
Q3->2.0
1 true
1 or(!$not_t0.131868131868 !$servic_t0.307692307692)
1 or(!$not_t0.131868131868 !$onlin_t0.0659340659341)
1 or(!$have_t0.0659340659341 !$onlin_t0.0659340659341)
1 or(!$becaus_t0.0659340659341 !$not_t0.131868131868
!$servic_t0.307692307692)
1 or(!$becaus_t0.0659340659341 !$not_t0.131868131868
!$onlin_t0.0659340659341)
1 or(!$becaus_t0.0659340659341 !$onlin_t0.0659340659341
$other_t0.0549450549451)
1 or($becaus_t0.0659340659341 !$not_t0.131868131868
!$onlin_t0.0659340659341)
1 or(!$good_t0.164835164835 !$not_t0.131868131868 !$onlin_t0.0659340659341)
1 or($good_t0.164835164835 !$not_t0.131868131868 !$onlin_t0.0659340659341)
</ensemble>
<ensemble>
Q3->3.0
1 or(!$they_t0.120879120879 !$not_t0.131868131868)
1 or(!$they_t0.120879120879 !$custom_t0.0879120879121)
1 or(!$custom_t0.0879120879121 !$friend_t0.0549450549451)
1 or(!$custom_t0.0879120879121 $servic_t0.307692307692)
1 or($becaus_t0.0659340659341 !$they_t0.120879120879
!$custom_t0.0879120879121)
bank.model contains all 10 classifiers, Q3->2.0, Q3->3.0, etc,
indicating their respective thresholds. The tFLOAT appended to the
variable names indicate the thresholds of the occurrence of words in
the model. Just like the threshold of the output, this threshold is
non-strict. Finally, the number in front of the model (here, they are
all at 1) is the weight of the model used by the voting algorithm.
3.2. Inference
--------------
Now let's use the learned model to predict the value of Q3, given a set
of inputs in the file 'bankcase.hcs':
$ head bankcase.hcs
great : 1
fee : 0
they : 0
best : 3
servic : 1
The value after the colon indicates the number of times the word
at the left appears in the message. Words in the .hcs file must be
stemmed in the same way as the dataset (in this case, according to
Porter's algorithm).
$ poses inference -m bank.model -p Q3 -c bankcase.hcs
The flags used are:
-m bank.model is the file of model previously learned
-p Q3 the target feature, same as before
-c bankcase.hcs is the file of the case to test
The output printed is the centroid for the expected Q3 value;
in this case, it is 10.5
3.3. VOI -- Variables of Interest
---------------------------------
The voivar command extracts the variables used by the model and
computes their mutual information relative to the parent variable.
$ poses voivar -d bank.dat -m bank.model -p Q3 -o bank.voi
The flags used are:
-d bank.dat is the file of the training data
-m bank.model is the file of model previously learned
-p Q3 the target feature, same as before
-o bank.voi is the output file where to dump the VOI result
The output is:
$ head bank.voi
voi(onli=40.5312163794)
voi(recommend=27.5938046739)
voi(there=24.6305266323)
voi(they=16.5564632505)
voi(not=14.4232890056)
voi(friend=13.1772030679)
voi(do=11.2473845306)
voi(becaus=9.53928821138)
voi(bank=6.12103885756)
voi(just=6.02169414068)
So the most important word is "onli" (the stemmed version of "only")
with score 40.5, followed by "recommend". Note that 'function words',
such as "they", "there", "not", etc. appear in this list; there is no
attempt to remove these. Research, such as that from Pennebaker, et al.,
shows that such words are the dominant predictors of feelings and
emotional state; they are better predictors than 'content words', and
should *not* be filtered or ignored.
3.4. Accuracy
-------------
This command allows us to perform cross-validation to test that
the methodology did not result in a model that over-fits the data.
So let's take the same command line used for training, but using the
command 'accuracy' instead of 'moses', and indicate the number of
cross validations to perform.
$ poses accuracy -f 5 -d bank.dat -m bank.model -p Q3 -e 10000 -n 11 -F
"-ainc -C15 -j4" -M "-Z1 -j4"
The flags used are:
-f 5 indicates that the training data will be divided into 5 subsets,
each subset will be taken out of the total dataset to create the
training dataset. A new model will be learned on this training
set; the remaining 1/5 will be used as a testing set, to evaluate
the learned model.
Note that this command will take roughly 5 times longer to run than the
original training run: essentially, 5 distinct training runs will be
performed and evaluated. Thus, assessing the accuracy of long training
runs is not to be lightly undertaken.
Train matrix:
Classifier results in columns, expected results in rows.
0 0 0 0 1 1 1 4 1 0 0
0 1 0 1 1 1 0 0 0 0 0
0 0 12 0 1 2 1 0 0 0 0
0 0 0 0 2 2 1 4 3 0 0
0 0 0 1 7 0 0 4 0 0 0
0 0 0 0 1 10 2 16 3 0 0
0 0 0 0 0 2 21 16 1 0 0
0 0 0 0 0 0 1 37 2 0 0
0 0 0 0 0 0 0 21 19 19 1
0 0 0 0 0 0 1 15 8 34 2
0 0 0 0 0 0 0 17 5 41 17
Accuracy: 0.434065934066 (158 correct out of 364 total)
----
Test matrix:
Computed results in columns, expected results in rows.
0 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0
0 0 2 0 1 0 1 0 0 0 0
0 0 0 0 0 0 0 0 1 2 0
0 0 0 0 1 0 0 1 1 0 0
0 0 0 0 2 0 1 5 0 0 0
0 0 0 0 1 0 2 1 4 0 2
0 0 1 0 0 1 1 5 0 1 1
0 0 0 0 0 1 1 4 0 8 1
0 0 0 0 1 0 0 6 0 8 0
0 0 1 0 1 2 1 4 1 10 0
Accuracy: 0.197802197802 (18 correct out of 91 total)
The training matrix presents a bin-count of expected answers, in rows,
versus the results obtained by the classifier, in columns. A perfectly
accurate classifier would have entries only along the diagonal. That is,
the classifier would always produce the expected result. The accuracy
is the fraction of correct answer: that is, the sum of diagonal entries
divided by the sum of all entries.
The test matrix presents the same bin-count, but for the test subset of
the data. That is, the classifier was trained on the training set, but
scored on the test subset. Since the training and test subsets are
disjoint, a well-behaved classifier would have similar accuracy on both
the training and the test matrix. A markedly worse score on the test
matrix than the training matrix indicates that the the classifier is
over-fitting the data. That is, the classifier has accounted for nuances
in the training set that are simply not present in the "real data", as
captured by the test matrix.
In this case, the accuracy is particularly bad due to the short training
run (the -e flag), the large number of output levels (the -n flag), a
failure to discretize the inputs more finely (the -s flag was not used),
and the small size of the dataset (90 rows), given the large number of
features (38) that are used as inputs (that is, the input data are fairly
randomly distributed, and inherently cannot carry much predictive value).
For example, a longer run provides the following:
$ poses accuracy -f5 -dbank.dat -mbank.model -pQ3 -e40000 -n5 -s3 -F "-ainc
-C25" -M "-Z1"
Train matrix:
Classifier results in columns, expected results in rows.
14 5 8 1 0
0 14 10 0 0
0 1 103 6 2
0 0 25 91 4
0 0 7 38 35
Accuracy: 0.706043956044 (257 correct out of 364 total)
----
Test matrix:
Classifier results in columns, expected results in rows.
2 1 4 0 0
0 0 3 2 1
1 4 9 11 3
0 0 12 14 4
0 2 6 11 1
Accuracy: 0.285714285714 (26 correct out of 91 total)
4.0 The End
-----------
This is the end of the banking tutorial.
bank.dat
Description: Binary data
bank.model
Description: Binary data
