Re: [agi] Re: Lexical model learning for LLMs

Matt Mahoney Mon, 27 Nov 2023 08:37:06 -0800

I separated the effects of XML parsing and article reordering in cmix
for the Hutter prize. Recall that I first found a baseline for enwik9
compression using generic compression models without specializations
for text.


Baseline enwik9
 178,051,852 enwik9.zpaq -ms7ci1.1.1.1.2am (context mixing)
 181,583,242 enwik9.bsc -b100 -T (Burrows Wheeler transform)
 184,909,536 enwik9.pmd -m256 -o16 -r1 (PPM)
 230,135,777 enwik9.7z (LZ77 + arithmetic)
 322,592,218 enwik9.zip -9 (LZ77 + Huffman)
1,000,000,000 enwik9

enwik9 preprocessed with my small dictionary encoder built using byte
pair encoding. Words are restricted to characters of the same type,
either lowercase letters including &; , digits, the @ to mark the next
letter as uppercase, white space including </> , or single punctuation
characters possibly repeated. Dictionary codes are single bytes but
use 255 to encode a literal of length 1-64 or any UTF-8 character.
This improved compression in every case.

 166,044,757 x.zpaq (-12 MB from baseline)
 176,252,128 x.bsc (-5 MB)
 177,765,678 x.pmd (-7 MB)
 213,991,893 x.7z (-17 MB)
 298,426,680 x.zip (-24 MB)
 651,316,380 x

The cmix v20 preprocessor used in the last 2 Hutter prize winners
parses the ~243,426 XML article headers and encodes them separately
from the text. The articles are reordered from a list of numbers in
the file .new_article_order. The list is not needed for decompression
because the headers contain an <id> field with numbers in ascending
order, allowing them to be sorted. It also does some Wikipedia
specific text substitutions to make the file more like ordinary text.
It decodes &amp; &lt; &quot; &gt; to & < " > and swaps [[ ]] (for
article links) with [ ] (for website links, which are less common).
This by itself improves compression. The preprocessor output is
.ready4cmix.

 162,899,592 ready4cmix.zpaq (-16 MB from baseline)
 169,660,387 ready4cmix.pmd (-15 MB)
 171,141,389 ready4cmix.bsc (-10 MB)
 208,166,180 ready4cmix.7z (-22 MB)
 296,085,159 ready4cmix.zip (-26 MB)
 934,188,796 ready4cmix

After my small dictionary encoder:

 153,927,354 x.zpaq (-25 MB from baseline)
 164,584,921 x.pmd (-20 MB)
 168,483,120 x.bsc (-13 MB)
 196,925,172 x.7z (-34 MB)
 277,211,683 x.zip (-45 MB)
 646,368,233 x

My next experiment was to separate the effects of article reordering
from the other XML processing. I wrote a program that reorders the
articles by matching the titles in ready4cmix but making no other
changes. My small dictionary encoder finds symbols for &amp; &quot;
&lt; &gt; [[ ]]  [ ] so any difference should be mainly due to the
header preprocessing. Also the output of my reordering program
includes the offset and length as decimal numbers at the beginning of
each article, which makes the output 3.5 MB larger (probably 2 MB
compressed). Here is the result of reordering only.

 166,309,461 x.zpaq (+3.5 MB from cmix)
 172,802,959 x.pmd (+3.2 MB)
 173,870,086 x.bsc (+2.7 MB)
 215,778,398 x.7z (+7.7 MB)
 309,881,543 x.zip (+13 MB)
1,003,566,215 x (+3.5 MB from baseline, +69 MB from cmix)

After my small dictionary encoding.

 156,431,185 x1.zpaq (+2.5 MB from cmix + dictionary)
 167,210,773 x1.pmd (+2.7 MB)
 170,550,586 x1.bsc (+2.1 MB)
 201,639,890 x1.7z (+4.7 MB)
 285,309,381 x1.zip (+8.1 MB)
 654,944,554 x1 (+8.6 MB)

The reason all of this works is that all of these compressors are
memory constrained. They forget older statistics, so moving related
sections closer together helps.

-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mc3567e16f3aa048c236c3ce0
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Re: Lexical model learning for LLMs

Reply via email to