Re: Separate memory init file enlarges overall code size

Soeren Balko Fri, 26 Dec 2014 21:53:16 -0800

I just opened two pull requests for the incoming branches of the 
emscripten-fastcomp and emscripten repositories: 
https://github.com/kripken/emscripten-fastcomp/pull/57, 
https://github.com/kripken/emscripten/pull/3106. These patches take care of 
rendering statically allocated memory as an (escaped) UTF8 string in the 
backend. In order to enable the functionality, I added the configuration 
option "UTF8_STATIC_MEMORY" (see settings.js). It's on by default. When set 
to 0, it will generate the static memory as before (i.e., as JS arrays of 
integers, representing byte values).


Enjoy,
Soeren

On Thursday, December 25, 2014 2:57:53 PM UTC+10, Soeren Balko wrote:
>
> @Alon: Found it (emscripten-fastcomp/lib/Target/JSBackend/JSBackend.cpp) 
> and will add the feature myself. I would suggest hiding it behind a flag 
> like "-s UTF8_MEMORY=1" or so. 
>
> Soeren
>
> On Wednesday, December 24, 2014 1:14:01 PM UTC+10, Soeren Balko wrote:
>>
>> I just submitted a pull request, which extends the "allocate" function to 
>> accept static memory defined as an UTF-8 string, where the Unicode 
>> character code points are the byte values: 
>> https://github.com/kripken/emscripten/pull/3106
>>
>> In order to replace the current representation of static memory as 
>> Javascript arrays with compact UTF-8 strings (see my previous post), I 
>> created a "poor man's solution", which is a simple node script that regexps 
>> in the emscripten-generated Javascript "binary" and replaces all 
>> "allocate([...], ...)" calls with "allocate("...", ...). The resulting 
>> reduction in code size is quite noticeable - I did not measure the impact 
>> on parsing times, though: 
>> https://gist.github.com/anonymous/74196a36efbb4733a6f5
>>
>> @Alon: Obviously, that functionality should be integrated into emscripten 
>> itself. However, after the change to the LLVM backend, I haven't bothered 
>> finding my way in there. Can you please suggest where to look (or simply 
>> incorporate the functionality yourself, if that's a quick addition)?
>>
>> Happy holidays everyone,
>> Soeren
>>
>> On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote:
>>>
>>> Another (minor) optimization is to use the standard Javascript escapes 
>>> \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 
>>> bytes if not succeeded by a digit, then the fixed-length [4 byte] hex  \xYZ 
>>> encoding must be used). 
>>>
>>> Generally though, I cannot confirm that the "ministr" memory 
>>> representation is smaller than base64. In my case, it is, in fact larger. 
>>> Assuming a uniform distribution of byte values, the ministr representation 
>>> in UTF-8 uses:
>>>
>>> 1 byte for the 95 "Latin 1" characters with a Unicode code point between 
>>> U+0020...U+007E - 37.1%
>>> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code 
>>> point between U+00A0...U+00FF - 37.5%
>>> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v 
>>> (ignoring the \0 followed by digit case) - 2.7%
>>> 3 bytes for the remaining 25 characters in octal representation between 
>>> U+0001...U+001F - 9.8%
>>> 4 bytes for the remaining 33 characters in hex representation between 
>>> U+007F...U+009F - 12.9%
>>>
>>> So on average, we get some 1.985 bytes per character. In turn, base64 
>>> uses 1.333 bytes per character (it only uses characters that use one byte 
>>> in UTF-8), but produces a non-human-readable memory representation. For the 
>>> existing int8-array representation, we get the following:
>>>
>>> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) - 
>>> 3.9%
>>> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) 
>>> - 35.2%
>>> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three 
>>> digits and one comma) - 60.9%
>>>
>>> On average, that yields 3.57 bytes per character. 
>>>
>>> Of course, real-world static memory content is often skewed towards 
>>> certain byte values, e.g. \0 and Latin-1 text characters. In those cases, 
>>> the ministr approach may yield a more compact representation that base64. 
>>> Other baseX approaches (notably: basE91) may be worth the try, but would 
>>> need a potentially slow, pure Javascript-based implementation. 
>>>
>>> In the program that I looked at (ffmpeg), the static memory content 
>>> seems to also exhibit ranges of recurrent identical byte values (often \0), 
>>> which is amenable to a simple RLE encoding scheme, which could be overlayed 
>>> over the ministr encoding. Not sure if this is worthwhile doing as this is 
>>> essentially what gzip is doing anyway and it comes with a small runtime 
>>> overhead to expand the RLE-encoded sequences.
>>>
>>> Soeren
>>>
>>> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote:
>>>>
>>>> I think the patch is here: https://gist.github.com/evanw/11339324
>>>>
>>>>
>>>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote:
>>>>
>>>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]> 
>>>> wrote:
>>>>
>>>>> So far, my tryout implementation is based on a script that I run using 
>>>>> --js-transform. It uses regular expressions to find integer arrays and 
>>>>> replaces them with some base64 string and a function wrapper around them 
>>>>> to 
>>>>> turn them into an int8 array. I like the ministr approach as it preserves 
>>>>> the (printable) byte sequences (thus benefitting readability of string 
>>>>> literals) and apparently speeds up parsing time. If only they had 
>>>>> provided 
>>>>> their escaping code for non-printable characters. 
>>>>>
>>>>
>>>> Here is the code I wrote for my tests: 
>>>> https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py
>>>>
>>>> Evan pointed out that my code is incorrect in the case of an octal 
>>>> escape followed by numeric digits, but I don't think he posted his code.
>>>>  
>>>>
>>>>> Also, I still need to figure where exactly the "allocate([....], ...)" 
>>>>> calls are generated and change the code in there.
>>>>>
>>>>> If only for the sake of speeding up the JS parser, I wonder if some 
>>>>> basic inline RLE compression could be done as well. It would most 
>>>>> probably 
>>>>> not help with the gzipped file, but keep the uncompressed JS file smaller 
>>>>> and potentially up parsing time at the expense of a small runtime 
>>>>> overhead 
>>>>> to expand the RLE-encoded byte sequences into a region on the heap. 
>>>>>
>>>>
>>>> Hm, I wonder if the improved JS parse time would be offset by the more 
>>>> complex decoding / startup JITting.  Probably worth measuring.
>>>>
>>>> Either way, a straight up string literal would be a huge improvement 
>>>> over the status quo for people who can't or don't want to use a separate 
>>>> meminit binary file.
>>>>
>>>> Thanks for investigating this.  :)
>>>>
>>>>
>>>>> Soeren
>>>>>
>>>>>
>>>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote:
>>>>>>
>>>>>> Hi Soeren,
>>>>>>
>>>>>> @evanw and I have done similar research in this issue: 
>>>>>> https://github.com/kripken/emscripten/issues/2188
>>>>>>
>>>>>> If we represent the meminit block as a large string literal rather 
>>>>>> than an array of 8-bit numbers, it would reduce code size by about 50%, 
>>>>>> improve JavaScript parse time, AND make it more readable, as C string 
>>>>>> literals would be visible in the output.
>>>>>>
>>>>>> Fixing this has been on our wishlist for some time and if you want to 
>>>>>> take a crack at it, we would be thrilled!
>>>>>>
>>>>>> Let me know if there's anything we can do to help,
>>>>>> Chad
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> I played around with the separate memory init file and was surprised 
>>>>>>> to see that it does, in fact, increase the total code size. In fact, 
>>>>>>> the 
>>>>>>> numbers I got are:
>>>>>>>
>>>>>>> * JS with inline memory initialization: 23186642 bytes
>>>>>>> * JS and separate memory init file:  15250276+8988744 = 24239020 
>>>>>>> bytes
>>>>>>>
>>>>>>> That's a bit surprising to me as I would expect the binary memory 
>>>>>>> init file to spend one byte per, well, byte in HEAP8. Also, the inline 
>>>>>>> memory initializer is a plain JS array, which is unecessarily large 
>>>>>>> (each 
>>>>>>> value takes at least 1-3 bytes per byte plus 1 byte for the comma). If 
>>>>>>> the 
>>>>>>> initial memory values were encoded as an UTF-8 string (and at runtime 
>>>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" 
>>>>>>> (=byte 
>>>>>>> on the heap), only (on average if memory init values are uniformly 
>>>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable 
>>>>>>> characters in the generated JS file. Not sure if all JS interpreters 
>>>>>>> would 
>>>>>>> like that. If no, base64 (or basE91 for less overhead - see 
>>>>>>> http://base91.sourceforge.net/), would still use up less space in 
>>>>>>> the JS file. 
>>>>>>>
>>>>>>> If noone objects, I would work on implementing the latter.
>>>>>>>
>>>>>>> Soeren
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "emscripten-discuss" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Chad Austin
>>>>>> Technical Director, IMVU
>>>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>>>> http://chadaustin.me
>>>>>>
>>>>>>
>>>>>>  
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "emscripten-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>>> Chad Austin
>>>> Technical Director, IMVU
>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>> http://chadaustin.me
>>>>
>>>>
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "emscripten-discuss" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>> Soeren Balko, PhD
>>>> Founder & Director
>>>> zfaas Pty Ltd
>>>> Brisbane, QLD
>>>> Australia
>>>>
>>>>
>>>>  
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Separate memory init file *enlarges* overall code size

Reply via email to

Re: Separate memory init file enlarges overall code size