Byte and bit level prediction doesn't affect the compression ratio in theory because any byte prediction can be factored into 8 bit predictions by the chain rule: p(ab) = p(a)p(b|a). Where it makes a difference is speed. If you use a range code you have to split the range into 2 parts to encode a bit or 256 parts to encode a byte, in proportion to the probability distribution in either case. To normalize the distribution you have to add up the predictions and divide by the sum so they add up to 1. For a byte, that's 256 steps. Then to find the right piece of the range you have to add up all the probabilities that are lexicographically less than the byte you are encoding, which is 128 steps on average. For bit encoding, it's just 1 step, but repeated 8 times.
Byte prediction isn't necessarily slower because most of the computation is modeling, not coding. For bit level prediction like in PAQ you are evaluating the model 8 times, each with an additional bit of context, and updating the model 8 times after each prediction. Most LLMs predict tokens from a vocabulary in the tens of thousands, so the difference is even greater. But it still might be faster. The prediction code is highly parallel, and even some of the encoding steps can be parallel, like using an addition tree to reduce the encoding to the same number of steps as encoding bits. The PAQ/ZPAQ components like CM, ICM, ISSE, SSE, and MIX all operate on bit predictions. Every component in the list takes a context and possibly some bit predictions from earlier components and outputs a new prediction. The last component in the list is the final output to be encoded. For byte or token prediction you would need a new set of components that output a vector of probabities instead of a single number. This would mean 256 times more computation and memory for byte prediction or tens of thousands for tokens. But that's what neural language models do. LSTM and transformers use an attention model to speed this up at the cost of some compression by setting all but the top few elements to 0. -- Matt Mahoney, [email protected] On Tue, Dec 9, 2025, 1:47 PM <[email protected]> wrote: > Do you know how much the score falls if you change an algorithm's type of > prediction from byte to bit? For example if an algorithm compresses 100,000 > bytes to 23,000 using byte prediction, will simply changing the same > algorithm to do bit prediction score 21,500 bytes r maybe just 22,500 bytes > or maybe no improvement? How much? > > Also how much for SSE? (or ISSE). > And how much for updating mixer weights based on error feedback? > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + > delivery options <https://agi.topicbox.com/groups/agi/subscription> > Permalink > <https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M077039668592391e49c807bd> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M3b659ceca81ead1ef1dbcdc9 Delivery options: https://agi.topicbox.com/groups/agi/subscription
