--- On Sat, 9/20/08, Pei Wang <[EMAIL PROTECTED]> wrote: > Think about a concrete example: if from one source the > system gets > P(A-->B) = 0.9, and P(P(A-->B) = 0.9) = 0.5, while > from another source > P(A-->B) = 0.2, and P(P(A-->B) = 0.2) = 0.7, then > what will be the > conclusion when the two sources are considered together?
This is a common problem in text prediction. In general, there is no right answer. You have to determine experimentally what works best. You compute the probability using some method, run it on some test data, and measure the accuracy of your predictions. To give a more concrete example, suppose that A is some context (the last n bytes of text), and B is the event that the next bit is a 1. We get different predictions for different orders (different values of n) which we need to combine. In PAQ1-PAQ3, I count zeros and ones in context A. Call these counts c0 and c1. Then I let P(A-->B) = c1/(c0+c1) and let the confidence (what you call P(P(A-->B)) be c0+c1. To combine them I add up the c0's and c1's and compute SUM c1 / (SUM c0 + SUM c1). I also discovered experimentally that the prediction is more accurate if the counts are weighted by n^2. For example the order 19 context: "the cat caught a mo_" is a better predictor of the next symbol than the order 2 context: "mo_" even though the latter has probably collected more statistics, and therefore has a higher confidence. In PAQ4-PAQ6 I adjust the weights dynamically using gradient descent of coding cost in weight space to reduce prediction error. This can be improved further by using multiple weight tables indexed by a low order context. In PAQ7-PAQ8 I dynamically map each bit history (truncated c0,c1 plus the last bit) to a probability p_i using a table that is adjusted to reduce prediction error when the bit is observed. Then the predictions p_i are combined using a neural network: p = squash(SUM w_i stretch(p_i)) where squash(x) = 1/(1 + exp(-x)) bounds the output to (0, 1), and stretch(x) = ln(x / (1 - x)) is the inverse of squash. (This implicitly gives greater confidence to probabilities near 0 or 1). When actual bit b is observed, the weights are adjusted to reduce the prediction error b - p by gradient descent of coding cost in weight space as follows: w_i := w_i + L stretch(p_i) (b - p) where L is the learning rate, typically 0.001 to 0.005. Again, we can improve this by using multiple weight tables indexed by a low order context. Or you can use multiple neural networks indexed by different order contexts and combine them by linear averaging or another neural network. In PAQ9 I use chains of 2 input neural networks, where one input is the previous prediction from the next lower context order and the other input is fixed. The weight table is selected by the bit history in the next higher context. This method is still experimental. It works well for simple n-gram models but worse than PAQ8 when there are large numbers of approximately equally good predictions to combine, such as when semantic (cat ~ mouse) and other contexts are added. -- Matt Mahoney, [EMAIL PROTECTED] ------------------------------------------- agi Archives: https://www.listbox.com/member/archive/303/=now RSS Feed: https://www.listbox.com/member/archive/rss/303/ Modify Your Subscription: https://www.listbox.com/member/?member_id=8660244&id_secret=114414975-3c8e69 Powered by Listbox: http://www.listbox.com
