Yes this is a nice conversation. 
I know that marcus is working on slim binary. 
The idea is to compress but I forgot if it was bytecode or trees.

Now I wonder if the code text is worth compression. 

>> On 18 Feb 2022, at 21:25, Guillermo Polito <guillermopol...@gmail.com> wrote:
>> 
>> Thanks Sven, great stuff :) 
> 
> Thanks!
> 
> This allows you to easily play/explore/experiment with certain ideas.
> 
> In the past we discussed about the option of bringing source code inside the 
> image, what if we applied compression ?
> 
> The total size of all Object methods is about 100k:
> 
> Object allMethods sum: [ :each | each sourceCode size ].
> 
> "104633"
> 
> We can compress each individual method as an LZ4 block and see what that 
> gives us.
> 
> LZ4Compressor new in: [ :compressor |
>  Object allMethods sum: [ :each | | compressed |
>    compressed := compressor compressBlock: each sourceCode utf8Encoded.
>    compressed size ] ].
> 
> "81584"
> 
> (104633/81584) reciprocal asFloat.
> 
> "0.7797157684478129"
> 
> That is about 22% smaller. This is not a very good result. But that is to be 
> expected because methods are small and there is often not much to compress.
> 
> If we concatenate all source code and feed that as one big chunk to the 
> compressor we get much better results.
> 
> (LZ4Compressor new compress:
>  (String streamContents: [ :out |
>    Object allMethods do: [ :each | out nextPutAll: each sourceCode ] ]) 
> utf8Encoded) size.
> 
> "53544"       
> 
> (104633/53544) reciprocal asFloat.
> 
> "0.5117314805080615"
> 
> Now we get an almost 50% reduction in size. But methods are independent, so 
> that is not an option. What if we used a dictionary, a predefined set of 
> words/substrings that are common in source code.
> 
> I found a list of the 500 most common English words. Let's add some common 
> selectors and globals.
> 
> IdentityBag new in: [ :bag |
>  SystemNavigation default allMethods do: [ :each |
>    each literals select: [ :x | x isSymbol ] thenDo: [ :x | bag add: x ] ].
>  bag sortedCounts select: [ :x | x key > 100 ] ].
> 
> IdentityBag new in: [ :bag |
>  SystemNavigation default allMethods do: [ :each |
>    each literals select: [ :x | x isVariableBinding ] thenDo: [ :x | bag add: 
> x key ] ].
>  bag sortedCounts select: [ :x | x key > 100 ] ].
> 
> The smallest possible match in LZ4 is 4 bytes (3 letters and a space).
> 
> words := Character space join: (((FileLocator desktop / 'en-500.csv' 
> readStreamDo: [ :in | (NeoCSVReader on: in) addIgnoredField; addField; 
> upToEnd ]) collect: #first) select: [ :each | each size > 2 ]).
> 
> That are 473 words. Next are 137 selectors.
> 
> selectors := ' ifTrue: assert: class assert:equals: ifTrue:ifFalse: ifNil: 
> ifFalse: yourself name and: ifNotNil: traitComposition add: first deny: 
> includes: isEmpty asString nextPutAll: with: isNil collect: initialize 
> subclassResponsibility to:do: selector localMethodDict should:raise: theme 
> notNil printString on:do: streamContents: at:ifAbsent: copy contents error: 
> last model default ifNil:ifNotNil: organization skipOrReturnWith:ifSkippable: 
> current parserExceptions nonEmpty select: asSymbol name: readStream 
> includesKey: basicNew title: empty reject: whileTrue: keys space class: 
> extent: close anySatisfy: parse:documentURI: isLocalSelector: traitSource 
> second position print: whileFalse: asArray format: printOn: selectors 
> isKindOf: copyFrom:to: color: shouldnt:raise: width height max: named: signal 
> hasProperty: anyOne text detect:ifNone: label: ensure: ifEmpty: extent text: 
> entity addAll: negated includesLocalSelector: addSelector:withMethod: 
> traitDefining:ifNone: hash addSelector:on: asInteger min: translated 
> iconNamed: method arguments position: withIndexDo: perform: methods delete 
> url: occurrencesOf: selector: hResizing: with:with: pass notEmpty flag: 
> values removeKey: fromString: classNamed: reset changed removeKey:ifAbsent: 
> width: announce: repository: signal: setUp addLast: session uniqueInstance 
> assert:description: asOrderedCollection compiledMethod assert:gives: '.
> 
> Finally 73 globals.
> 
> globals := ' String OrderedCollection Array Smalltalk Color Error Character 
> Dictionary TraitChange ByteArray Object UIManager DateAndTime Set Form Time 
> ZTimestamp Date RBParser World Protocol Duration Processor SAXHandler Display 
> MetaLink HelpTopic ReflectivityExamples IdentitySet OCOpalExamples 
> GLMTabulator Float SpecLayout UUID WAMimeType SystemAnnouncer STON 
> XMLDOMParser ZnMimeType Transcript ZnEntity ExternalType CompiledMethod 
> GRPlatform Semaphore FileSystem ReadWriteStream ZnClient WriteStream Delay 
> CmdContextMenuActivation ZnResponse FileLocator IdentityDictionary Morph 
> MCSnapshot ReflectiveMethod XMLValidationException Integer MCMethodDefinition 
> Path ClyClassScope MCClassDefinition RBCondition MCVersionInfo MCVersion 
> SmallInteger Cursor TraitedClass GoferVersionReference SortedCollection 
> MCOrganizationDefinition XMLWellFormednessException '.
> 
> dictionary := (globals , words  , selectors) utf8Encoded.
> 
> This dictionary is less than 5K.
> 
> (LZ4Compressor new dictionary: dictionary) in: [ :compressor |
>  Object allMethods sum: [ :each | | compressed |
>    compressed := compressor compressBlock: each sourceCode utf8Encoded.
>    compressed size ] ].
> 
> "69146"
> 
> (104633/69146) reciprocal asFloat.
> 
> "0.6608431374422983"
> 
> Now we get a 33% reduction in size, which is better.
> 
> I am sure that with a more carefully, better tuned dictionary the compression 
> rate could be improved a couple of percent. There also exist tools that can 
> compute an optimal dictionary from a given input set.
> 
> Sorry for the long post, I hope at least someone found this interesting.
> 
> Sven
> 
> 
>> Envoyé depuis mon téléphone Huawei
>> 
>> 
>> -------- Message original --------
>> De : Sven Van Caekenberghe <s...@stfx.eu>
>> Date : ven. 18 févr. 2022 à 21:13
>> À : Any question about pharo is welcome <pharo-users@lists.pharo.org>
>> Objet : [Pharo-users] [ANN] Pharo LZ4 Tools
>> Hi,
>> 
>> Pharo LZ4 Tools (https://github.com/svenvc/pharo-lz4-tools) is an 
>> implementation of LZ4 compression and decompression in pure Pharo.
>> 
>> LZ4 is a lossless compression algorithm that is focused on speed. It belongs 
>> to the LZ77 family of byte-oriented compression schemes.
>> 
>> - https://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
>> - https://lz4.github.io/lz4/
>> - https://github.com/lz4/lz4
>> 
>> Both the frame format 
>> (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md) as well as the 
>> block format (https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md) 
>> are implemented. Dictionary based compression/decompression is available 
>> too. The XXHash32 algorithm is also implemented.
>> 
>> Of course this implementation is not as fast as highly optimised native 
>> implementations, but it works quite well and is readable/understandable, if 
>> you like this kind of stuff. It can be useful to interact with other systems 
>> using LZ4.
>> 
>> Sven


Reply via email to