I separated the effects of XML parsing and article reordering in cmix for the Hutter prize. Recall that I first found a baseline for enwik9 compression using generic compression models without specializations for text.
Baseline enwik9 178,051,852 enwik9.zpaq -ms7ci1.1.1.1.2am (context mixing) 181,583,242 enwik9.bsc -b100 -T (Burrows Wheeler transform) 184,909,536 enwik9.pmd -m256 -o16 -r1 (PPM) 230,135,777 enwik9.7z (LZ77 + arithmetic) 322,592,218 enwik9.zip -9 (LZ77 + Huffman) 1,000,000,000 enwik9 enwik9 preprocessed with my small dictionary encoder built using byte pair encoding. Words are restricted to characters of the same type, either lowercase letters including &; , digits, the @ to mark the next letter as uppercase, white space including </> , or single punctuation characters possibly repeated. Dictionary codes are single bytes but use 255 to encode a literal of length 1-64 or any UTF-8 character. This improved compression in every case. 166,044,757 x.zpaq (-12 MB from baseline) 176,252,128 x.bsc (-5 MB) 177,765,678 x.pmd (-7 MB) 213,991,893 x.7z (-17 MB) 298,426,680 x.zip (-24 MB) 651,316,380 x The cmix v20 preprocessor used in the last 2 Hutter prize winners parses the ~243,426 XML article headers and encodes them separately from the text. The articles are reordered from a list of numbers in the file .new_article_order. The list is not needed for decompression because the headers contain an <id> field with numbers in ascending order, allowing them to be sorted. It also does some Wikipedia specific text substitutions to make the file more like ordinary text. It decodes & < " > to & < " > and swaps [[ ]] (for article links) with [ ] (for website links, which are less common). This by itself improves compression. The preprocessor output is .ready4cmix. 162,899,592 ready4cmix.zpaq (-16 MB from baseline) 169,660,387 ready4cmix.pmd (-15 MB) 171,141,389 ready4cmix.bsc (-10 MB) 208,166,180 ready4cmix.7z (-22 MB) 296,085,159 ready4cmix.zip (-26 MB) 934,188,796 ready4cmix After my small dictionary encoder: 153,927,354 x.zpaq (-25 MB from baseline) 164,584,921 x.pmd (-20 MB) 168,483,120 x.bsc (-13 MB) 196,925,172 x.7z (-34 MB) 277,211,683 x.zip (-45 MB) 646,368,233 x My next experiment was to separate the effects of article reordering from the other XML processing. I wrote a program that reorders the articles by matching the titles in ready4cmix but making no other changes. My small dictionary encoder finds symbols for & " < > [[ ]] [ ] so any difference should be mainly due to the header preprocessing. Also the output of my reordering program includes the offset and length as decimal numbers at the beginning of each article, which makes the output 3.5 MB larger (probably 2 MB compressed). Here is the result of reordering only. 166,309,461 x.zpaq (+3.5 MB from cmix) 172,802,959 x.pmd (+3.2 MB) 173,870,086 x.bsc (+2.7 MB) 215,778,398 x.7z (+7.7 MB) 309,881,543 x.zip (+13 MB) 1,003,566,215 x (+3.5 MB from baseline, +69 MB from cmix) After my small dictionary encoding. 156,431,185 x1.zpaq (+2.5 MB from cmix + dictionary) 167,210,773 x1.pmd (+2.7 MB) 170,550,586 x1.bsc (+2.1 MB) 201,639,890 x1.7z (+4.7 MB) 285,309,381 x1.zip (+8.1 MB) 654,944,554 x1 (+8.6 MB) The reason all of this works is that all of these compressors are memory constrained. They forget older statistics, so moving related sections closer together helps. -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tdc371ce11a040352-Mc3567e16f3aa048c236c3ce0 Delivery options: https://agi.topicbox.com/groups/agi/subscription
