Re: Separate memory init file enlarges overall code size

Alon Zakai Mon, 29 Dec 2014 11:23:13 -0800

Thanks, I'll take a look at those pulls now. Let's move the discussion to
there.


- Alon


On Fri, Dec 26, 2014 at 9:52 PM, Soeren Balko <[email protected]> wrote:

> I just opened two pull requests for the incoming branches of the
> emscripten-fastcomp and emscripten repositories:
> https://github.com/kripken/emscripten-fastcomp/pull/57,
> https://github.com/kripken/emscripten/pull/3106. These patches take care
> of rendering statically allocated memory as an (escaped) UTF8 string in the
> backend. In order to enable the functionality, I added the configuration
> option "UTF8_STATIC_MEMORY" (see settings.js). It's on by default. When set
> to 0, it will generate the static memory as before (i.e., as JS arrays of
> integers, representing byte values).
>
> Enjoy,
> Soeren
>
>
> On Thursday, December 25, 2014 2:57:53 PM UTC+10, Soeren Balko wrote:
>>
>> @Alon: Found it (emscripten-fastcomp/lib/Target/JSBackend/JSBackend.cpp)
>> and will add the feature myself. I would suggest hiding it behind a flag
>> like "-s UTF8_MEMORY=1" or so.
>>
>> Soeren
>>
>> On Wednesday, December 24, 2014 1:14:01 PM UTC+10, Soeren Balko wrote:
>>>
>>> I just submitted a pull request, which extends the "allocate" function
>>> to accept static memory defined as an UTF-8 string, where the Unicode
>>> character code points are the byte values: https://github.com/kripken/
>>> emscripten/pull/3106
>>>
>>> In order to replace the current representation of static memory as
>>> Javascript arrays with compact UTF-8 strings (see my previous post), I
>>> created a "poor man's solution", which is a simple node script that regexps
>>> in the emscripten-generated Javascript "binary" and replaces all
>>> "allocate([...], ...)" calls with "allocate("...", ...). The resulting
>>> reduction in code size is quite noticeable - I did not measure the impact
>>> on parsing times, though: https://gist.github.com/anonymous/
>>> 74196a36efbb4733a6f5
>>>
>>> @Alon: Obviously, that functionality should be integrated into
>>> emscripten itself. However, after the change to the LLVM backend, I haven't
>>> bothered finding my way in there. Can you please suggest where to look (or
>>> simply incorporate the functionality yourself, if that's a quick addition)?
>>>
>>> Happy holidays everyone,
>>> Soeren
>>>
>>> On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote:
>>>>
>>>> Another (minor) optimization is to use the standard Javascript escapes
>>>> \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3
>>>> bytes if not succeeded by a digit, then the fixed-length [4 byte] hex  \xYZ
>>>> encoding must be used).
>>>>
>>>> Generally though, I cannot confirm that the "ministr" memory
>>>> representation is smaller than base64. In my case, it is, in fact larger.
>>>> Assuming a uniform distribution of byte values, the ministr representation
>>>> in UTF-8 uses:
>>>>
>>>> 1 byte for the 95 "Latin 1" characters with a Unicode code point
>>>> between U+0020...U+007E - 37.1%
>>>> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code
>>>> point between U+00A0...U+00FF - 37.5%
>>>> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t,
>>>> \v (ignoring the \0 followed by digit case) - 2.7%
>>>> 3 bytes for the remaining 25 characters in octal representation between
>>>> U+0001...U+001F - 9.8%
>>>> 4 bytes for the remaining 33 characters in hex representation between
>>>> U+007F...U+009F - 12.9%
>>>>
>>>> So on average, we get some 1.985 bytes per character. In turn, base64
>>>> uses 1.333 bytes per character (it only uses characters that use one byte
>>>> in UTF-8), but produces a non-human-readable memory representation. For the
>>>> existing int8-array representation, we get the following:
>>>>
>>>> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma)
>>>> - 3.9%
>>>> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma)
>>>> - 35.2%
>>>> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three
>>>> digits and one comma) - 60.9%
>>>>
>>>> On average, that yields 3.57 bytes per character.
>>>>
>>>> Of course, real-world static memory content is often skewed towards
>>>> certain byte values, e.g. \0 and Latin-1 text characters. In those cases,
>>>> the ministr approach may yield a more compact representation that base64.
>>>> Other baseX approaches (notably: basE91) may be worth the try, but would
>>>> need a potentially slow, pure Javascript-based implementation.
>>>>
>>>> In the program that I looked at (ffmpeg), the static memory content
>>>> seems to also exhibit ranges of recurrent identical byte values (often \0),
>>>> which is amenable to a simple RLE encoding scheme, which could be overlayed
>>>> over the ministr encoding. Not sure if this is worthwhile doing as this is
>>>> essentially what gzip is doing anyway and it comes with a small runtime
>>>> overhead to expand the RLE-encoded sequences.
>>>>
>>>> Soeren
>>>>
>>>> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote:
>>>>>
>>>>> I think the patch is here: https://gist.github.com/evanw/11339324
>>>>>
>>>>>
>>>>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote:
>>>>>
>>>>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> So far, my tryout implementation is based on a script that I run
>>>>>> using --js-transform. It uses regular expressions to find integer arrays
>>>>>> and replaces them with some base64 string and a function wrapper around
>>>>>> them to turn them into an int8 array. I like the ministr approach as it
>>>>>> preserves the (printable) byte sequences (thus benefitting readability of
>>>>>> string literals) and apparently speeds up parsing time. If only they had
>>>>>> provided their escaping code for non-printable characters.
>>>>>>
>>>>>
>>>>> Here is the code I wrote for my tests: https://github.com/chadaustin/
>>>>> Web-Benchmarks/blob/master/meminit/meminit.py
>>>>>
>>>>> Evan pointed out that my code is incorrect in the case of an octal
>>>>> escape followed by numeric digits, but I don't think he posted his code.
>>>>>
>>>>>
>>>>>> Also, I still need to figure where exactly the "allocate([....],
>>>>>> ...)" calls are generated and change the code in there.
>>>>>>
>>>>>> If only for the sake of speeding up the JS parser, I wonder if some
>>>>>> basic inline RLE compression could be done as well. It would most 
>>>>>> probably
>>>>>> not help with the gzipped file, but keep the uncompressed JS file smaller
>>>>>> and potentially up parsing time at the expense of a small runtime 
>>>>>> overhead
>>>>>> to expand the RLE-encoded byte sequences into a region on the heap.
>>>>>>
>>>>>
>>>>> Hm, I wonder if the improved JS parse time would be offset by the more
>>>>> complex decoding / startup JITting.  Probably worth measuring.
>>>>>
>>>>> Either way, a straight up string literal would be a huge improvement
>>>>> over the status quo for people who can't or don't want to use a separate
>>>>> meminit binary file.
>>>>>
>>>>> Thanks for investigating this.  :)
>>>>>
>>>>>
>>>>>> Soeren
>>>>>>
>>>>>>
>>>>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote:
>>>>>>>
>>>>>>> Hi Soeren,
>>>>>>>
>>>>>>> @evanw and I have done similar research in this issue:
>>>>>>> https://github.com/kripken/emscripten/issues/2188
>>>>>>>
>>>>>>> If we represent the meminit block as a large string literal rather
>>>>>>> than an array of 8-bit numbers, it would reduce code size by about 50%,
>>>>>>> improve JavaScript parse time, AND make it more readable, as C string
>>>>>>> literals would be visible in the output.
>>>>>>>
>>>>>>> Fixing this has been on our wishlist for some time and if you want
>>>>>>> to take a crack at it, we would be thrilled!
>>>>>>>
>>>>>>> Let me know if there's anything we can do to help,
>>>>>>> Chad
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I played around with the separate memory init file and was
>>>>>>>> surprised to see that it does, in fact, increase the total code size. 
>>>>>>>> In
>>>>>>>> fact, the numbers I got are:
>>>>>>>>
>>>>>>>> * JS with inline memory initialization: 23186642 bytes
>>>>>>>> * JS and separate memory init file:  15250276+8988744 = 24239020
>>>>>>>> bytes
>>>>>>>>
>>>>>>>> That's a bit surprising to me as I would expect the binary memory
>>>>>>>> init file to spend one byte per, well, byte in HEAP8. Also, the inline
>>>>>>>> memory initializer is a plain JS array, which is unecessarily large 
>>>>>>>> (each
>>>>>>>> value takes at least 1-3 bytes per byte plus 1 byte for the comma). If 
>>>>>>>> the
>>>>>>>> initial memory values were encoded as an UTF-8 string (and at runtime
>>>>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" 
>>>>>>>> (=byte
>>>>>>>> on the heap), only (on average if memory init values are uniformly
>>>>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable
>>>>>>>> characters in the generated JS file. Not sure if all JS interpreters 
>>>>>>>> would
>>>>>>>> like that. If no, base64 (or basE91 for less overhead - see
>>>>>>>> http://base91.sourceforge.net/), would still use up less space in
>>>>>>>> the JS file.
>>>>>>>>
>>>>>>>> If noone objects, I would work on implementing the latter.
>>>>>>>>
>>>>>>>> Soeren
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "emscripten-discuss" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Chad Austin
>>>>>>> Technical Director, IMVU
>>>>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>>>>> http://chadaustin.me
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "emscripten-discuss" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chad Austin
>>>>> Technical Director, IMVU
>>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>>> http://chadaustin.me
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "emscripten-discuss" group.
>>>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>>>> topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>>
>>>>> Soeren Balko, PhD
>>>>> Founder & Director
>>>>> zfaas Pty Ltd
>>>>> Brisbane, QLD
>>>>> Australia
>>>>>
>>>>>
>>>>>
>>>>>  --
> You received this message because you are subscribed to the Google Groups
> "emscripten-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Separate memory init file *enlarges* overall code size

Reply via email to

Re: Separate memory init file enlarges overall code size