> On 18 Feb 2022, at 21:25, Guillermo Polito <guillermopol...@gmail.com> wrote:
> 
> Thanks Sven, great stuff :) 

Thanks!

This allows you to easily play/explore/experiment with certain ideas.

In the past we discussed about the option of bringing source code inside the 
image, what if we applied compression ?

The total size of all Object methods is about 100k:

Object allMethods sum: [ :each | each sourceCode size ].

"104633"

We can compress each individual method as an LZ4 block and see what that gives 
us.

LZ4Compressor new in: [ :compressor |
  Object allMethods sum: [ :each | | compressed |
    compressed := compressor compressBlock: each sourceCode utf8Encoded.
    compressed size ] ].

"81584"

(104633/81584) reciprocal asFloat.

"0.7797157684478129"

That is about 22% smaller. This is not a very good result. But that is to be 
expected because methods are small and there is often not much to compress.

If we concatenate all source code and feed that as one big chunk to the 
compressor we get much better results.

(LZ4Compressor new compress:
  (String streamContents: [ :out |
    Object allMethods do: [ :each | out nextPutAll: each sourceCode ] ]) 
utf8Encoded) size.
 
"53544" 

(104633/53544) reciprocal asFloat.

"0.5117314805080615"

Now we get an almost 50% reduction in size. But methods are independent, so 
that is not an option. What if we used a dictionary, a predefined set of 
words/substrings that are common in source code.

I found a list of the 500 most common English words. Let's add some common 
selectors and globals.

IdentityBag new in: [ :bag |
  SystemNavigation default allMethods do: [ :each |
    each literals select: [ :x | x isSymbol ] thenDo: [ :x | bag add: x ] ].
  bag sortedCounts select: [ :x | x key > 100 ] ].

IdentityBag new in: [ :bag |
  SystemNavigation default allMethods do: [ :each |
    each literals select: [ :x | x isVariableBinding ] thenDo: [ :x | bag add: 
x key ] ].
  bag sortedCounts select: [ :x | x key > 100 ] ].

The smallest possible match in LZ4 is 4 bytes (3 letters and a space).

words := Character space join: (((FileLocator desktop / 'en-500.csv' 
readStreamDo: [ :in | (NeoCSVReader on: in) addIgnoredField; addField; upToEnd 
]) collect: #first) select: [ :each | each size > 2 ]).

That are 473 words. Next are 137 selectors.

selectors := ' ifTrue: assert: class assert:equals: ifTrue:ifFalse: ifNil: 
ifFalse: yourself name and: ifNotNil: traitComposition add: first deny: 
includes: isEmpty asString nextPutAll: with: isNil collect: initialize 
subclassResponsibility to:do: selector localMethodDict should:raise: theme 
notNil printString on:do: streamContents: at:ifAbsent: copy contents error: 
last model default ifNil:ifNotNil: organization skipOrReturnWith:ifSkippable: 
current parserExceptions nonEmpty select: asSymbol name: readStream 
includesKey: basicNew title: empty reject: whileTrue: keys space class: extent: 
close anySatisfy: parse:documentURI: isLocalSelector: traitSource second 
position print: whileFalse: asArray format: printOn: selectors isKindOf: 
copyFrom:to: color: shouldnt:raise: width height max: named: signal 
hasProperty: anyOne text detect:ifNone: label: ensure: ifEmpty: extent text: 
entity addAll: negated includesLocalSelector: addSelector:withMethod: 
traitDefining:ifNone: hash addSelector:on: asInteger min: translated iconNamed: 
method arguments position: withIndexDo: perform: methods delete url: 
occurrencesOf: selector: hResizing: with:with: pass notEmpty flag: values 
removeKey: fromString: classNamed: reset changed removeKey:ifAbsent: width: 
announce: repository: signal: setUp addLast: session uniqueInstance 
assert:description: asOrderedCollection compiledMethod assert:gives: '.

Finally 73 globals.

globals := ' String OrderedCollection Array Smalltalk Color Error Character 
Dictionary TraitChange ByteArray Object UIManager DateAndTime Set Form Time 
ZTimestamp Date RBParser World Protocol Duration Processor SAXHandler Display 
MetaLink HelpTopic ReflectivityExamples IdentitySet OCOpalExamples GLMTabulator 
Float SpecLayout UUID WAMimeType SystemAnnouncer STON XMLDOMParser ZnMimeType 
Transcript ZnEntity ExternalType CompiledMethod GRPlatform Semaphore FileSystem 
ReadWriteStream ZnClient WriteStream Delay CmdContextMenuActivation ZnResponse 
FileLocator IdentityDictionary Morph MCSnapshot ReflectiveMethod 
XMLValidationException Integer MCMethodDefinition Path ClyClassScope 
MCClassDefinition RBCondition MCVersionInfo MCVersion SmallInteger Cursor 
TraitedClass GoferVersionReference SortedCollection MCOrganizationDefinition 
XMLWellFormednessException '.

dictionary := (globals , words  , selectors) utf8Encoded.

This dictionary is less than 5K.

(LZ4Compressor new dictionary: dictionary) in: [ :compressor |
  Object allMethods sum: [ :each | | compressed |
    compressed := compressor compressBlock: each sourceCode utf8Encoded.
    compressed size ] ].

"69146"

(104633/69146) reciprocal asFloat.
 
"0.6608431374422983"

Now we get a 33% reduction in size, which is better.

I am sure that with a more carefully, better tuned dictionary the compression 
rate could be improved a couple of percent. There also exist tools that can 
compute an optimal dictionary from a given input set.

Sorry for the long post, I hope at least someone found this interesting.

Sven


> Envoyé depuis mon téléphone Huawei
> 
> 
> -------- Message original --------
> De : Sven Van Caekenberghe <s...@stfx.eu>
> Date : ven. 18 févr. 2022 à 21:13
> À : Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Objet : [Pharo-users] [ANN] Pharo LZ4 Tools
> Hi,
> 
> Pharo LZ4 Tools (https://github.com/svenvc/pharo-lz4-tools) is an 
> implementation of LZ4 compression and decompression in pure Pharo.
> 
> LZ4 is a lossless compression algorithm that is focused on speed. It belongs 
> to the LZ77 family of byte-oriented compression schemes.
> 
> - https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
> - https://lz4.github.io/lz4/
> - https://github.com/lz4/lz4
> 
> Both the frame format 
> (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md) as well as the 
> block format (https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md) 
> are implemented. Dictionary based compression/decompression is available too. 
> The XXHash32 algorithm is also implemented.
> 
> Of course this implementation is not as fast as highly optimised native 
> implementations, but it works quite well and is readable/understandable, if 
> you like this kind of stuff. It can be useful to interact with other systems 
> using LZ4.
> 
> Sven

Reply via email to