I just submitted a pull request, which extends the "allocate" function to
accept static memory defined as an UTF-8 string, where the Unicode
character code points are the byte values:
https://github.com/kripken/emscripten/pull/3106
In order to replace the current representation of static memory as
Javascript arrays with compact UTF-8 strings (see my previous post), I
created a "poor man's solution", which is a simple node script that regexps
in the emscripten-generated Javascript "binary" and replaces all
"allocate([...], ...)" calls with "allocate("...", ...). The resulting
reduction in code size is quite noticeable - I did not measure the impact
on parsing times, though:
https://gist.github.com/anonymous/74196a36efbb4733a6f5
@Alon: Obviously, that functionality should be integrated into emscripten
itself. However, after the change to the LLVM backend, I haven't bothered
finding my way in there. Can you please suggest where to look (or simply
incorporate the functionality yourself, if that's a quick addition)?
Happy holidays everyone,
Soeren
On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote:
>
> Another (minor) optimization is to use the standard Javascript escapes \t,
> \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 bytes
> if not succeeded by a digit, then the fixed-length [4 byte] hex \xYZ
> encoding must be used).
>
> Generally though, I cannot confirm that the "ministr" memory
> representation is smaller than base64. In my case, it is, in fact larger.
> Assuming a uniform distribution of byte values, the ministr representation
> in UTF-8 uses:
>
> 1 byte for the 95 "Latin 1" characters with a Unicode code point between
> U+0020...U+007E - 37.1%
> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code
> point between U+00A0...U+00FF - 37.5%
> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v
> (ignoring the \0 followed by digit case) - 2.7%
> 3 bytes for the remaining 25 characters in octal representation between
> U+0001...U+001F - 9.8%
> 4 bytes for the remaining 33 characters in hex representation between
> U+007F...U+009F - 12.9%
>
> So on average, we get some 1.985 bytes per character. In turn, base64 uses
> 1.333 bytes per character (it only uses characters that use one byte in
> UTF-8), but produces a non-human-readable memory representation. For the
> existing int8-array representation, we get the following:
>
> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) -
> 3.9%
> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) -
> 35.2%
> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three digits
> and one comma) - 60.9%
>
> On average, that yields 3.57 bytes per character.
>
> Of course, real-world static memory content is often skewed towards
> certain byte values, e.g. \0 and Latin-1 text characters. In those cases,
> the ministr approach may yield a more compact representation that base64.
> Other baseX approaches (notably: basE91) may be worth the try, but would
> need a potentially slow, pure Javascript-based implementation.
>
> In the program that I looked at (ffmpeg), the static memory content seems
> to also exhibit ranges of recurrent identical byte values (often \0), which
> is amenable to a simple RLE encoding scheme, which could be overlayed over
> the ministr encoding. Not sure if this is worthwhile doing as this is
> essentially what gzip is doing anyway and it comes with a small runtime
> overhead to expand the RLE-encoded sequences.
>
> Soeren
>
> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote:
>>
>> I think the patch is here: https://gist.github.com/evanw/11339324
>>
>>
>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote:
>>
>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]> wrote:
>>
>>> So far, my tryout implementation is based on a script that I run using
>>> --js-transform. It uses regular expressions to find integer arrays and
>>> replaces them with some base64 string and a function wrapper around them to
>>> turn them into an int8 array. I like the ministr approach as it preserves
>>> the (printable) byte sequences (thus benefitting readability of string
>>> literals) and apparently speeds up parsing time. If only they had provided
>>> their escaping code for non-printable characters.
>>>
>>
>> Here is the code I wrote for my tests:
>> https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py
>>
>> Evan pointed out that my code is incorrect in the case of an octal escape
>> followed by numeric digits, but I don't think he posted his code.
>>
>>
>>> Also, I still need to figure where exactly the "allocate([....], ...)"
>>> calls are generated and change the code in there.
>>>
>>> If only for the sake of speeding up the JS parser, I wonder if some
>>> basic inline RLE compression could be done as well. It would most probably
>>> not help with the gzipped file, but keep the uncompressed JS file smaller
>>> and potentially up parsing time at the expense of a small runtime overhead
>>> to expand the RLE-encoded byte sequences into a region on the heap.
>>>
>>
>> Hm, I wonder if the improved JS parse time would be offset by the more
>> complex decoding / startup JITting. Probably worth measuring.
>>
>> Either way, a straight up string literal would be a huge improvement over
>> the status quo for people who can't or don't want to use a separate meminit
>> binary file.
>>
>> Thanks for investigating this. :)
>>
>>
>>> Soeren
>>>
>>>
>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote:
>>>>
>>>> Hi Soeren,
>>>>
>>>> @evanw and I have done similar research in this issue:
>>>> https://github.com/kripken/emscripten/issues/2188
>>>>
>>>> If we represent the meminit block as a large string literal rather than
>>>> an array of 8-bit numbers, it would reduce code size by about 50%, improve
>>>> JavaScript parse time, AND make it more readable, as C string literals
>>>> would be visible in the output.
>>>>
>>>> Fixing this has been on our wishlist for some time and if you want to
>>>> take a crack at it, we would be thrilled!
>>>>
>>>> Let me know if there's anything we can do to help,
>>>> Chad
>>>>
>>>>
>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]>
>>>> wrote:
>>>>
>>>>> I played around with the separate memory init file and was surprised
>>>>> to see that it does, in fact, increase the total code size. In fact, the
>>>>> numbers I got are:
>>>>>
>>>>> * JS with inline memory initialization: 23186642 bytes
>>>>> * JS and separate memory init file: 15250276+8988744 = 24239020 bytes
>>>>>
>>>>> That's a bit surprising to me as I would expect the binary memory init
>>>>> file to spend one byte per, well, byte in HEAP8. Also, the inline memory
>>>>> initializer is a plain JS array, which is unecessarily large (each value
>>>>> takes at least 1-3 bytes per byte plus 1 byte for the comma). If the
>>>>> initial memory values were encoded as an UTF-8 string (and at runtime
>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry"
>>>>> (=byte
>>>>> on the heap), only (on average if memory init values are uniformly
>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable
>>>>> characters in the generated JS file. Not sure if all JS interpreters
>>>>> would
>>>>> like that. If no, base64 (or basE91 for less overhead - see
>>>>> http://base91.sourceforge.net/), would still use up less space in the
>>>>> JS file.
>>>>>
>>>>> If noone objects, I would work on implementing the latter.
>>>>>
>>>>> Soeren
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "emscripten-discuss" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chad Austin
>>>> Technical Director, IMVU
>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>> http://chadaustin.me
>>>>
>>>>
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "emscripten-discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Chad Austin
>> Technical Director, IMVU
>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>> http://chadaustin.me
>>
>>
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "emscripten-discuss" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> Soeren Balko, PhD
>> Founder & Director
>> zfaas Pty Ltd
>> Brisbane, QLD
>> Australia
>>
>>
>>
>>
--
You received this message because you are subscribed to the Google Groups
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.