Yes this is a nice conversation. I know that marcus is working on slim binary. The idea is to compress but I forgot if it was bytecode or trees.
Now I wonder if the code text is worth compression. >> On 18 Feb 2022, at 21:25, Guillermo Polito <guillermopol...@gmail.com> wrote: >> >> Thanks Sven, great stuff :) > > Thanks! > > This allows you to easily play/explore/experiment with certain ideas. > > In the past we discussed about the option of bringing source code inside the > image, what if we applied compression ? > > The total size of all Object methods is about 100k: > > Object allMethods sum: [ :each | each sourceCode size ]. > > "104633" > > We can compress each individual method as an LZ4 block and see what that > gives us. > > LZ4Compressor new in: [ :compressor | > Object allMethods sum: [ :each | | compressed | > compressed := compressor compressBlock: each sourceCode utf8Encoded. > compressed size ] ]. > > "81584" > > (104633/81584) reciprocal asFloat. > > "0.7797157684478129" > > That is about 22% smaller. This is not a very good result. But that is to be > expected because methods are small and there is often not much to compress. > > If we concatenate all source code and feed that as one big chunk to the > compressor we get much better results. > > (LZ4Compressor new compress: > (String streamContents: [ :out | > Object allMethods do: [ :each | out nextPutAll: each sourceCode ] ]) > utf8Encoded) size. > > "53544" > > (104633/53544) reciprocal asFloat. > > "0.5117314805080615" > > Now we get an almost 50% reduction in size. But methods are independent, so > that is not an option. What if we used a dictionary, a predefined set of > words/substrings that are common in source code. > > I found a list of the 500 most common English words. Let's add some common > selectors and globals. > > IdentityBag new in: [ :bag | > SystemNavigation default allMethods do: [ :each | > each literals select: [ :x | x isSymbol ] thenDo: [ :x | bag add: x ] ]. > bag sortedCounts select: [ :x | x key > 100 ] ]. > > IdentityBag new in: [ :bag | > SystemNavigation default allMethods do: [ :each | > each literals select: [ :x | x isVariableBinding ] thenDo: [ :x | bag add: > x key ] ]. > bag sortedCounts select: [ :x | x key > 100 ] ]. > > The smallest possible match in LZ4 is 4 bytes (3 letters and a space). > > words := Character space join: (((FileLocator desktop / 'en-500.csv' > readStreamDo: [ :in | (NeoCSVReader on: in) addIgnoredField; addField; > upToEnd ]) collect: #first) select: [ :each | each size > 2 ]). > > That are 473 words. Next are 137 selectors. > > selectors := ' ifTrue: assert: class assert:equals: ifTrue:ifFalse: ifNil: > ifFalse: yourself name and: ifNotNil: traitComposition add: first deny: > includes: isEmpty asString nextPutAll: with: isNil collect: initialize > subclassResponsibility to:do: selector localMethodDict should:raise: theme > notNil printString on:do: streamContents: at:ifAbsent: copy contents error: > last model default ifNil:ifNotNil: organization skipOrReturnWith:ifSkippable: > current parserExceptions nonEmpty select: asSymbol name: readStream > includesKey: basicNew title: empty reject: whileTrue: keys space class: > extent: close anySatisfy: parse:documentURI: isLocalSelector: traitSource > second position print: whileFalse: asArray format: printOn: selectors > isKindOf: copyFrom:to: color: shouldnt:raise: width height max: named: signal > hasProperty: anyOne text detect:ifNone: label: ensure: ifEmpty: extent text: > entity addAll: negated includesLocalSelector: addSelector:withMethod: > traitDefining:ifNone: hash addSelector:on: asInteger min: translated > iconNamed: method arguments position: withIndexDo: perform: methods delete > url: occurrencesOf: selector: hResizing: with:with: pass notEmpty flag: > values removeKey: fromString: classNamed: reset changed removeKey:ifAbsent: > width: announce: repository: signal: setUp addLast: session uniqueInstance > assert:description: asOrderedCollection compiledMethod assert:gives: '. > > Finally 73 globals. > > globals := ' String OrderedCollection Array Smalltalk Color Error Character > Dictionary TraitChange ByteArray Object UIManager DateAndTime Set Form Time > ZTimestamp Date RBParser World Protocol Duration Processor SAXHandler Display > MetaLink HelpTopic ReflectivityExamples IdentitySet OCOpalExamples > GLMTabulator Float SpecLayout UUID WAMimeType SystemAnnouncer STON > XMLDOMParser ZnMimeType Transcript ZnEntity ExternalType CompiledMethod > GRPlatform Semaphore FileSystem ReadWriteStream ZnClient WriteStream Delay > CmdContextMenuActivation ZnResponse FileLocator IdentityDictionary Morph > MCSnapshot ReflectiveMethod XMLValidationException Integer MCMethodDefinition > Path ClyClassScope MCClassDefinition RBCondition MCVersionInfo MCVersion > SmallInteger Cursor TraitedClass GoferVersionReference SortedCollection > MCOrganizationDefinition XMLWellFormednessException '. > > dictionary := (globals , words , selectors) utf8Encoded. > > This dictionary is less than 5K. > > (LZ4Compressor new dictionary: dictionary) in: [ :compressor | > Object allMethods sum: [ :each | | compressed | > compressed := compressor compressBlock: each sourceCode utf8Encoded. > compressed size ] ]. > > "69146" > > (104633/69146) reciprocal asFloat. > > "0.6608431374422983" > > Now we get a 33% reduction in size, which is better. > > I am sure that with a more carefully, better tuned dictionary the compression > rate could be improved a couple of percent. There also exist tools that can > compute an optimal dictionary from a given input set. > > Sorry for the long post, I hope at least someone found this interesting. > > Sven > > >> Envoyé depuis mon téléphone Huawei >> >> >> -------- Message original -------- >> De : Sven Van Caekenberghe <s...@stfx.eu> >> Date : ven. 18 févr. 2022 à 21:13 >> À : Any question about pharo is welcome <pharo-users@lists.pharo.org> >> Objet : [Pharo-users] [ANN] Pharo LZ4 Tools >> Hi, >> >> Pharo LZ4 Tools (https://github.com/svenvc/pharo-lz4-tools) is an >> implementation of LZ4 compression and decompression in pure Pharo. >> >> LZ4 is a lossless compression algorithm that is focused on speed. It belongs >> to the LZ77 family of byte-oriented compression schemes. >> >> - https://en.wikipedia.org/wiki/LZ4_(compression_algorithm) >> - https://lz4.github.io/lz4/ >> - https://github.com/lz4/lz4 >> >> Both the frame format >> (https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md) as well as the >> block format (https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md) >> are implemented. Dictionary based compression/decompression is available >> too. The XXHash32 algorithm is also implemented. >> >> Of course this implementation is not as fast as highly optimised native >> implementations, but it works quite well and is readable/understandable, if >> you like this kind of stuff. It can be useful to interact with other systems >> using LZ4. >> >> Sven