Re: Separate memory init file enlarges overall code size

Soeren Balko Tue, 23 Dec 2014 19:14:15 -0800

I just submitted a pull request, which extends the "allocate" function to 
accept static memory defined as an UTF-8 string, where the Unicode 
character code points are the byte values: 
https://github.com/kripken/emscripten/pull/3106


In order to replace the current representation of static memory as 
Javascript arrays with compact UTF-8 strings (see my previous post), I 
created a "poor man's solution", which is a simple node script that regexps 
in the emscripten-generated Javascript "binary" and replaces all 
"allocate([...], ...)" calls with "allocate("...", ...). The resulting 
reduction in code size is quite noticeable - I did not measure the impact 
on parsing times, though: 
https://gist.github.com/anonymous/74196a36efbb4733a6f5

@Alon: Obviously, that functionality should be integrated into emscripten 
itself. However, after the change to the LLVM backend, I haven't bothered 
finding my way in there. Can you please suggest where to look (or simply 
incorporate the functionality yourself, if that's a quick addition)?

Happy holidays everyone,
Soeren

On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote:
>
> Another (minor) optimization is to use the standard Javascript escapes \t, 
> \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 bytes 
> if not succeeded by a digit, then the fixed-length [4 byte] hex  \xYZ 
> encoding must be used). 
>
> Generally though, I cannot confirm that the "ministr" memory 
> representation is smaller than base64. In my case, it is, in fact larger. 
> Assuming a uniform distribution of byte values, the ministr representation 
> in UTF-8 uses:
>
> 1 byte for the 95 "Latin 1" characters with a Unicode code point between 
> U+0020...U+007E - 37.1%
> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code 
> point between U+00A0...U+00FF - 37.5%
> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v 
> (ignoring the \0 followed by digit case) - 2.7%
> 3 bytes for the remaining 25 characters in octal representation between 
> U+0001...U+001F - 9.8%
> 4 bytes for the remaining 33 characters in hex representation between 
> U+007F...U+009F - 12.9%
>
> So on average, we get some 1.985 bytes per character. In turn, base64 uses 
> 1.333 bytes per character (it only uses characters that use one byte in 
> UTF-8), but produces a non-human-readable memory representation. For the 
> existing int8-array representation, we get the following:
>
> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) - 
> 3.9%
> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) - 
> 35.2%
> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three digits 
> and one comma) - 60.9%
>
> On average, that yields 3.57 bytes per character. 
>
> Of course, real-world static memory content is often skewed towards 
> certain byte values, e.g. \0 and Latin-1 text characters. In those cases, 
> the ministr approach may yield a more compact representation that base64. 
> Other baseX approaches (notably: basE91) may be worth the try, but would 
> need a potentially slow, pure Javascript-based implementation. 
>
> In the program that I looked at (ffmpeg), the static memory content seems 
> to also exhibit ranges of recurrent identical byte values (often \0), which 
> is amenable to a simple RLE encoding scheme, which could be overlayed over 
> the ministr encoding. Not sure if this is worthwhile doing as this is 
> essentially what gzip is doing anyway and it comes with a small runtime 
> overhead to expand the RLE-encoded sequences.
>
> Soeren
>
> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote:
>>
>> I think the patch is here: https://gist.github.com/evanw/11339324
>>
>>
>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote:
>>
>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]> wrote:
>>
>>> So far, my tryout implementation is based on a script that I run using 
>>> --js-transform. It uses regular expressions to find integer arrays and 
>>> replaces them with some base64 string and a function wrapper around them to 
>>> turn them into an int8 array. I like the ministr approach as it preserves 
>>> the (printable) byte sequences (thus benefitting readability of string 
>>> literals) and apparently speeds up parsing time. If only they had provided 
>>> their escaping code for non-printable characters. 
>>>
>>
>> Here is the code I wrote for my tests: 
>> https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py
>>
>> Evan pointed out that my code is incorrect in the case of an octal escape 
>> followed by numeric digits, but I don't think he posted his code.
>>  
>>
>>> Also, I still need to figure where exactly the "allocate([....], ...)" 
>>> calls are generated and change the code in there.
>>>
>>> If only for the sake of speeding up the JS parser, I wonder if some 
>>> basic inline RLE compression could be done as well. It would most probably 
>>> not help with the gzipped file, but keep the uncompressed JS file smaller 
>>> and potentially up parsing time at the expense of a small runtime overhead 
>>> to expand the RLE-encoded byte sequences into a region on the heap. 
>>>
>>
>> Hm, I wonder if the improved JS parse time would be offset by the more 
>> complex decoding / startup JITting.  Probably worth measuring.
>>
>> Either way, a straight up string literal would be a huge improvement over 
>> the status quo for people who can't or don't want to use a separate meminit 
>> binary file.
>>
>> Thanks for investigating this.  :)
>>
>>
>>> Soeren
>>>
>>>
>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote:
>>>>
>>>> Hi Soeren,
>>>>
>>>> @evanw and I have done similar research in this issue: 
>>>> https://github.com/kripken/emscripten/issues/2188
>>>>
>>>> If we represent the meminit block as a large string literal rather than 
>>>> an array of 8-bit numbers, it would reduce code size by about 50%, improve 
>>>> JavaScript parse time, AND make it more readable, as C string literals 
>>>> would be visible in the output.
>>>>
>>>> Fixing this has been on our wishlist for some time and if you want to 
>>>> take a crack at it, we would be thrilled!
>>>>
>>>> Let me know if there's anything we can do to help,
>>>> Chad
>>>>
>>>>
>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]> 
>>>> wrote:
>>>>
>>>>> I played around with the separate memory init file and was surprised 
>>>>> to see that it does, in fact, increase the total code size. In fact, the 
>>>>> numbers I got are:
>>>>>
>>>>> * JS with inline memory initialization: 23186642 bytes
>>>>> * JS and separate memory init file:  15250276+8988744 = 24239020 bytes
>>>>>
>>>>> That's a bit surprising to me as I would expect the binary memory init 
>>>>> file to spend one byte per, well, byte in HEAP8. Also, the inline memory 
>>>>> initializer is a plain JS array, which is unecessarily large (each value 
>>>>> takes at least 1-3 bytes per byte plus 1 byte for the comma). If the 
>>>>> initial memory values were encoded as an UTF-8 string (and at runtime 
>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" 
>>>>> (=byte 
>>>>> on the heap), only (on average if memory init values are uniformly 
>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable 
>>>>> characters in the generated JS file. Not sure if all JS interpreters 
>>>>> would 
>>>>> like that. If no, base64 (or basE91 for less overhead - see 
>>>>> http://base91.sourceforge.net/), would still use up less space in the 
>>>>> JS file. 
>>>>>
>>>>> If noone objects, I would work on implementing the latter.
>>>>>
>>>>> Soeren
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "emscripten-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Chad Austin
>>>> Technical Director, IMVU
>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>> http://chadaustin.me
>>>>
>>>>
>>>>  
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "emscripten-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> -- 
>> Chad Austin
>> Technical Director, IMVU
>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>> http://chadaustin.me
>>
>>
>>
>> -- 
>> You received this message because you are subscribed to a topic in the 
>> Google Groups "emscripten-discuss" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to 
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> Soeren Balko, PhD
>> Founder & Director
>> zfaas Pty Ltd
>> Brisbane, QLD
>> Australia
>>
>>
>>  
>>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Separate memory init file *enlarges* overall code size

Reply via email to

Re: Separate memory init file enlarges overall code size