--- On Sat, 9/20/08, Pei Wang <[EMAIL PROTECTED]> wrote:
> Think about a concrete example: if from one source the
> system gets
> P(A-->B) = 0.9, and P(P(A-->B) = 0.9) = 0.5, while
> from another source
> P(A-->B) = 0.2, and P(P(A-->B) = 0.2) = 0.7, then
> what will be the
> conclusion when the two sources are considered together?

This is a common problem in text prediction. In general, there is no right 
answer. You have to determine experimentally what works best. You compute the 
probability using some method, run it on some test data, and measure the 
accuracy of your predictions.

To give a more concrete example, suppose that A is some context (the last n 
bytes of text), and B is the event that the next bit is a 1. We get different 
predictions for different orders (different values of n) which we need to 
combine.

In PAQ1-PAQ3, I count zeros and ones in context A. Call these counts c0 and c1. 
Then I let P(A-->B) = c1/(c0+c1) and let the confidence (what you call 
P(P(A-->B)) be c0+c1. To combine them I add up the c0's and c1's and compute 
SUM c1 / (SUM c0 + SUM c1).

I also discovered experimentally that the prediction is more accurate if the 
counts are weighted by n^2. For example the order 19 context:

  "the cat caught a mo_"

is a better predictor of the next symbol than the order 2 context:

  "mo_"

even though the latter has probably collected more statistics, and therefore 
has a higher confidence.

In PAQ4-PAQ6 I adjust the weights dynamically using gradient descent of coding 
cost in weight space to reduce prediction error. This can be improved further 
by using multiple weight tables indexed by a low order context.

In PAQ7-PAQ8 I dynamically map each bit history (truncated c0,c1 plus the last 
bit) to a probability p_i using a table that is adjusted to reduce prediction 
error when the bit is observed. Then the predictions p_i are combined using a 
neural network:

  p = squash(SUM w_i stretch(p_i))

where squash(x) = 1/(1 + exp(-x)) bounds the output to (0, 1), and stretch(x) = 
ln(x / (1 - x)) is the inverse of squash. (This implicitly gives greater 
confidence to probabilities near 0 or 1). When actual bit b is observed, the 
weights are adjusted to reduce the prediction error b - p by gradient descent 
of coding cost in weight space as follows:

  w_i := w_i + L stretch(p_i) (b - p)

where L is the learning rate, typically 0.001 to 0.005. Again, we can improve 
this by using multiple weight tables indexed by a low order context. Or you can 
use multiple neural networks indexed by different order contexts and combine 
them by linear averaging or another neural network.

In PAQ9 I use chains of 2 input neural networks, where one input is the 
previous prediction from the next lower context order and the other input is 
fixed. The weight table is selected by the bit history in the next higher 
context. This method is still experimental. It works well for simple n-gram 
models but worse than PAQ8 when there are large numbers of approximately 
equally good predictions to combine, such as when semantic (cat ~ mouse) and 
other contexts are added.

-- Matt Mahoney, [EMAIL PROTECTED]




-------------------------------------------
agi
Archives: https://www.listbox.com/member/archive/303/=now
RSS Feed: https://www.listbox.com/member/archive/rss/303/
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=8660244&id_secret=114414975-3c8e69
Powered by Listbox: http://www.listbox.com

Reply via email to