I started the large text benchmark in 2006
(https://mattmahoney.net/dc/text.html ) with the claim that all you
need to pass the Turing test is text prediction, which you can measure
with compression. Both the benchmark and the Hutter prize use the same
1 GB text file (enwik9), with the goal of encouraging research in
language modeling. Those results are now paying off, with the top
compressors using neural networks just like the LLMs that have been
released in the last year. These are worth studying if you plan to do
any actual development work in AI.

The top 2 programs are nntp and cmix. nntp is described in two papers
linked on the benchmark. The first paper describes two versions using
LSTM and transformer networks, with the LSTM performing slightly
better. Later versions of nntp improved the transformer model, which
now takes the top spot on LTCB with a compression ratio 0.1072. It
uses a 6 layer transformer with 200M parameters. The input is the
current token and the output is a vector of probabilities over a 16K
vocabulary for the next token. It takes 3 days to compress or
decompress on a Geforce RTX-3090 GPU with 10,496 CUDA cores.

Second place is cmix, which doesn't need a GPU, but takes 8 days on a
Core i7-7700 with 32 GB of memory to achieve 0.1098 compression.
fast-cmix is a modified version that won the Hutter prize with 0.1137
meeting the constraints of 10 MB memory and 1.5 days CPU time on my
i7-1165G7 Lenovo laptop. It is a tuned version of starlit, which won
an earlier Hutter prize by sorting the Wikipedia articles in the test
file enwik9 by mutual information. cmix uses a PAQ based model which
combines the predictions of lots of context models using simple 2
layer neural networks. To save memory it uses a large PPM model, which
predicts at the byte level based on the longest matching context for
each possible outcome. This is more memory efficient than the bit
level predictors used in PAQ. cmix preprocesses the input by mapping
words to 1-3 byte tokens using a fixed 80K dictionary similar to wxrt
and drt. The dictionary is organized to group similar words together
like "mother" and "father" to allow partial bitwise contexts.

The nntp papers hint at possible improvements. Ideally a neural
network should use one parameter per bit of compressed training data,
or 1 billion. It could also use a larger vocabulary. The paper notes
that the network learns slowly in spite of making 20 training passes
every 64 tokens, which causes enwik8 compression (the first 100 MB) to
do worse than expected. Context models like in PAQ and cmix solve this
problem, but these lack the arbitrarily deep feature hierarchies that
allow neurons to represent highly abstract concepts. Natural language
is completely learnable from unlabeled training data starting with the
simplest concepts, using neurons to represent letters or phonemes,
word segmentation rules, words, semantics, and grammatical categories
and structures in the same order that children learn these features.
Both nntp and cmix use fixed dictionaries to convert the text to a
stream of tokens to be modeled.

I am doing experiments on learning the rules for tokenization. Back in
2000 I experimented in finding word boundaries in text without spaces.
These occur where there is low mutual information across boundaries.
WIth an order 5 model, this predicts the missing spaces with about 75%
accuracy. In my present experiments, I am using byte pair encoding.
Start with a dictionary of 255 codes each representing one byte, plus
one code for literal strings with no matching symbols. Find the pair
that could encode the most characters and make a new symbol, replacing
the least useful code. Repeat until there is no improvement and code
the input again and replace more symbols until the output stops
getting smaller.

Byte pair encoding by itself compresses, but to make the output more
compressible it is important that the symbol strings represent
independent semantic concepts, such as words, digits, or punctuation
characters. To achieve that, I require that all of the characters come
from the same set within a symbol. These sets are:

Lower case letters a-z and & # ; (to allow Wikipedia markup like &lt; for < )
@   (to indicate the next lower case letter should be capitalized).
Digits 0-9.
White space and < / >  (to encode XML markup).
All other characters are in their own set.

Encoding and decoding enwik9 takes about a minute. I tested the output
with 5 different compression algorithms that don't have text specific
models. These are:

zip -9 (LZ77 and Huffman coding. Repeated strings are coded as
pointers to the previous occurrence. -9 means max compression).

7zip (LZ77 and arithmetic coding with a larger window).

bsc -b100 -T  (BWT (suffix sorting followed by a low order, fast
adapting model) with a block size of 100 MB.)

ppmd -m256 -o16 -r1 (PPM byte level prediction and arithmetic coding.
Options for max compression using 256 MB memory, contexts up to 16
bytes, and partially rebuild the statistics tree when out of memory).

zpaq -ms7ci1.1.1.1.2am (context mixing with an ICM-ISSE chain of order
0, 1, 2, 3, 4, 6, and a match model that predicts the next bit
following the last context match of 8 or longer, all mixed using a
neural network selected by the order 0 context. An ICM maps the
context hash to a bit history and then to a bit prediction. An ISSE
mixes the previous prediction with a constant using the context has to
select a pair of weights).

Here are the results.

Baseline enwik9
 178,051,852 enwik9.zpaq
 181,583,242 enwik9.bsc
 184,909,536 enwik9.pmd
 230,135,777 enwik9.7z
 322,592,218 enwik9.zip
1,000,000,000 enwik9

Byte pair encoded.
 166,471,095 x.zpaq
 176,420,366 x.bsc
 178,037,227 x.pmd
 214,109,644 x.7z
 298,750,012 x.zip
 657,627,700 x

Here is the dictionary it built. Each number is the number of bytes
encoded by the symbol. This is after capitalization model. Encoding
"A" as "@a" etc expands the input by 3.8% but reduces the size
overall.

0 5211047 "
"
1 4373548 "

"
2 1167495 "
  "
3 117733608 " "
4 3606580 "  "
5 2931025 "&amp;"
6 4118380 "&gt;"
7 4096424 "&lt;"
8 1203660 "&lt;td&gt;"
9 11245188 "&quot;"
10 4976940 "''"
11 2875488 "'''"
12 2521806 "("
13 2523537 ")"
14 2490470 "*"
15 7978922 ","
16 3542288 "-"
17 9578832 "."
18 2689903 "/"
19 4211470 "0"
20 4036090 "1"
21 1301752 "18"
22 2700718 "19"
23 3950090 "2"
24 2313888 "200"
25 2736971 "3"
26 2579245 "4"
27 2596182 "5"
28 2372995 "6"
29 2005136 "7"
30 1792449 "8"
31 1889329 "9"
32 3536470 ":"
33 3714616 "</"
34 1533437 "="
35 2719548 "=="
36 2100736 ">"
37 4937064 ">
        <"
38 11365812 ">
      <"
39 2434260 ">
      </"
40 5119065 ">
    <"
41 1947408 ">
    </"
42 1460556 ">
  </"
43 41512171 "@"
44 19790976 "[["
45 19793156 "]]"
46 7868460 "a"
47 2253310 "ac"
48 1917990 "ad"
49 2411025 "age"
50 6215072 "al"
51 1146796 "also"
52 1408732 "am"
53 2096912 "american"
54 5741346 "an"
55 9907014 "and"
56 4131752 "ar"
57 3551010 "are"
58 3657708 "as"
59 3654828 "at"
60 2307250 "ation"
61 2068732 "b"
62 2246596 "ba"
63 4096522 "be"
64 1856768 "bl"
65 2250030 "bo"
66 1784450 "br"
67 1696670 "bu"
68 1661308 "by"
69 4447960 "c"
70 4783624 "ca"
71 3377360 "category"
72 3413858 "ce"
73 2556450 "census"
74 1331328 "cent"
75 4422008 "ch"
76 2755628 "ci"
77 4664290 "co"
78 3465348 "com"
79 2752603 "comment"
80 1073664 "cont"
81 5374974 "contributor"
82 1448808 "county"
83 2159750 "ct"
84 1639685 "ction"
85 7228135 "d"
86 1698436 "da"
87 5322652 "de"
88 1425969 "der"
89 3671270 "di"
90 1295300 "ding"
91 8313950 "e"
92 2033368 "ea"
93 4077406 "ed"
94 1506072 "el"
95 1415286 "em"
96 3160132 "en"
97 2125554 "ent"
98 4009852 "er"
99 4033320 "es"
100 3449798 "f"
101 1679906 "fa"
102 2191754 "fe"
103 2722322 "fi"
104 5207685 "for"
105 2664300 "from"
106 4533981 "g"
107 1814576 "ga"
108 2976462 "ge"
109 1615233 "ght"
110 2039118 "gr"
111 2393602 "h"
112 2643402 "ha"
113 1485180 "have"
114 2985598 "he"
115 1823524 "hi"
116 1982520 "his"
117 2019822 "ho"
118 1008135 "household"
119 1102920 "households"
120 1653396 "http"
121 5612158 "i"
122 1911778 "ia"
123 2482484 "ic"
124 3937786 "id"
125 2061248 "il"
126 1289110 "image"
127 11788674 "in"
128 3959475 "ing"
129 1121580 "inter"
130 1706220 "ion"
131 5791098 "is"
132 4533192 "it"
133 1693526 "j"
134 3443538 "k"
135 1935802 "ke"
136 1074780 "king"
137 1676720 "km&amp;sup"
138 4688188 "l"
139 3642740 "la"
140 1879684 "land"
141 4808666 "le"
142 5253150 "li"
143 3271514 "ll"
144 3282854 "lo"
145 2545420 "ly"
146 4316302 "m"
147 4111580 "ma"
148 1119480 "males"
149 2319318 "man"
150 1586886 "mar"
151 1228936 "mber"
152 4692548 "me"
153 1968364 "ment"
154 3559026 "mi"
155 1624740 "mi&amp;sup"
156 3325900 "mo"
157 6079102 "n"
158 2871918 "na"
159 1174776 "national"
160 1646913 "nce"
161 2328648 "nd"
162 4591032 "ne"
163 3307664 "ng"
164 2304930 "ni"
165 3850648 "no"
166 1935778 "ns"
167 2174184 "nt"
168 6369810 "o"
169 10290952 "of"
170 1882854 "ol"
171 4176046 "on"
172 1353369 "one"
173 4345328 "or"
174 1680510 "other"
175 2924346 "ou"
176 1277936 "over"
177 5716530 "p"
178 2438090 "pa"
179 2225092 "page"
180 1268168 "part"
181 3636874 "pe"
182 2404846 "pl"
183 2431580 "po"
184 2702850 "population"
185 1053200 "port"
186 1249800 "pres"
187 1996080 "preserve"
188 2405145 "pro"
189 7254014 "r"
190 4395294 "ra"
191 9660304 "re"
192 3926368 "revision"
193 5185568 "ri"
194 3029684 "ro"
195 1910144 "rs"
196 1231892 "rt"
197 11133241 "s"
198 1587710 "sa"
199 1517732 "sc"
200 5074948 "se"
201 2202958 "sh"
202 3748654 "si"
203 2194206 "so"
204 1044084 "some"
205 1752954 "sp"
206 1472240 "space"
207 2045926 "ss"
208 7298282 "st"
209 1140720 "states"
210 2392062 "su"
211 6344108 "t"
212 2786192 "ta"
213 3528818 "te"
214 1681107 "ted"
215 3591453 "ter"
216 2127476 "text"
217 3669638 "th"
218 2487472 "that"
219 26846655 "the"
220 1700620 "there"
221 1412172 "this"
222 4162324 "ti"
223 4382991 "timestamp"
224 3851636 "tion"
225 2792300 "title"
226 7878548 "to"
227 2647202 "tr"
228 1168294 "tt"
229 1495056 "ty"
230 6851023 "u"
231 1791650 "ul"
232 2941346 "un"
233 1188445 "under"
234 1237356 "united"
235 3364592 "ur"
236 2702458 "us"
237 1763349 "use"
238 3288768 "username"
239 3614588 "ve"
240 3622539 "ver"
241 2596844 "vi"
242 1088152 "ving"
243 2913531 "w"
244 1962994 "wa"
245 2598747 "was"
246 2613848 "we"
247 1694874 "wh"
248 1790215 "which"
249 1656796 "wi"
250 3236908 "with"
251 1244163 "wor"
252 663000 "ww"
253 6813940 "y"
254 4907106 "|"
255 18301782 - literal bytes.

Byte pair encoding learns the XML and Wikipedia markup quite well. For
example, "&lt;td&gt;" represents <td>, a marker for separating table
data. The file is in XML format like this, where the article is in the
<text> tag. The characters < " > are written as &lt; &quot; &gt; so
the XML parser is not confused. Byte pair encoding finds an efficient
encoding of the XML and markup like [[link to article]] or ===Level 3
Heading=== or '''bold text''' or &lt;ref&gt; for a numbered reference
in addition to learning an efficient way to encode words with fewer
symbols.

  <page>
    <title>...</title>
    <id>...</id>                        (decimal)
    <revision>
      <id>...</id>                      (decimal)
      <restrictions>...</restrictions>  (optional)
      <timestamp>...</timestamp>        (format yyyy-mm-ddThh:mm:ssZ)
      <contributor>
        <ip>...</ip>                    (optional)
        <username>...</username>        (if no <ip>)
        <id>...</id>                    (if no <ip>, 1-6 digits)
      </contributor>
      <minor />                         (optional)
      <comment>...</comment>            (optional, may span lines)
      <text xml:space="preserve">...</text>  (may span lines)
    </revision>
  </page>


-- 
-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-M4dc4c500116d0840b76567e4
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to