On 30 May 2014, at 10:59, Clément Bera <[email protected]> wrote:
> Hello,
>
> I like the idea but this is not as simple.
>
> In some framework you may use different string with a same name as markers
> that are not equals.
>
> Typically:
>
> Object>>#string1
> ^ 'string'
>
> Object>>#string2
> ^ 'string'
>
> Object>>#test
> self assert: self string1 == self string1. "Answers true"
> self assert: self string2 == self string2. "Answers true"
> self assert: self string1 == self string2 "Answers false"
>
> Frameworks relying on that will not work any more.
>
> And this kind of bugs is not easy to spot, it typically crashes identity
> collections in a non deterministic fashion.
>
With an indirection (a kind of reference) that
-> points to the string
-> forwards everything, but does a copy on write on state change
-> implements == to return false
it would work. Of course you have then the same amount of objects(+1), but they
would be all very
small, thus leading to saving for large objects and especially when applied to
subgraphs.
Marcus
> Regards
>
>
> 2014-05-30 9:39 GMT+02:00 Philippe Marschall
> <[email protected]>:
> Hi
>
> This is an idea I stole from somebody else. The assumption is that you have a
> lot of Strings in the image that are equal. We could therefore remove the
> duplicates and make all the objects refer to the same instance.
>
> However it's not a simple as that. The main issue is that String has two
> responsibilities. The first is as an immutable value object. The second is as
> a mutable character buffer for building immutable value objects. We must not
> deduplicate the second kind. Unfortunately it's not straight forward to
> figure out which kind a string is. The approach I took is looking at whether
> it contains any 0 characters. An other option would be to check whether any
> WirteStreams are referring to it.
> Also, since there are behavioral differences between String and Symbol
> besides #= we must exclude Symbols (eg. there is #'hello' and 'hello' in the
> heap and they compare #= true but we must not make anybody who refers to
> 'hello' suddenly refer to #'hello').
>
> Anyway here's the code, this saves about 2 MB in a fairly stock Pharo 3
> image. Sorry for the bad variable names.
>
> | b d m |
> b := Bag new.
> d := OrderedCollection new.
> m := Dictionary new.
> "count all string instances"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> b add: s ] ].
> "find the ones that have no duplicates or are likely buffers"
> b doWithOccurrences: [ :s :i |
> (i = 1 or: [ s anySatisfy: [ :c | c codePoint = 0 ] ]) ifTrue: [
> d add: s -> i ] ].
> "remove the ones that have no duplicates or are likely buffers"
> d do: [ :a |
> a value timesRepeat: [
> b remove: a key ] ].
> "map all duplicate strings to their duplicates"
> String allSubInstancesDo: [ :s |
> s isSymbol ifFalse: [
> (b includes: s) ifTrue: [
> | l |
> l := m at: s ifAbsentPut: [ OrderedCollection new ].
> l add: s ] ].
> "remove the duplicates"
> m keysAndValues do [ :k :v |
> | f |
> f := v at: 1.
> 2 to: v size do: [ :i |
> (v at: i) becomeForward: f ] ]
>
> Cheers
> Philippe
>
>
>