Another (minor) optimization is to use the standard Javascript escapes \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 bytes if not succeeded by a digit, then the fixed-length [4 byte] hex \xYZ encoding must be used).
Generally though, I cannot confirm that the "ministr" memory representation is smaller than base64. In my case, it is, in fact larger. Assuming a uniform distribution of byte values, the ministr representation in UTF-8 uses: 1 byte for the 95 "Latin 1" characters with a Unicode code point between U+0020...U+007E - 37.1% 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code point between U+00A0...U+00FF - 37.5% 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v (ignoring the \0 followed by digit case) - 2.7% 3 bytes for the remaining 25 characters in octal representation between U+0001...U+001F - 9.8% 4 bytes for the remaining 33 characters in hex representation between U+007F...U+009F - 12.9% So on average, we get some 1.985 bytes per character. In turn, base64 uses 1.333 bytes per character (it only uses characters that use one byte in UTF-8), but produces a non-human-readable memory representation. For the existing int8-array representation, we get the following: 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) - 3.9% 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) - 35.2% 4 bytes for the remaining 156 characters in U+0064...U+00FF (three digits and one comma) - 60.9% On average, that yields 3.57 bytes per character. Of course, real-world static memory content is often skewed towards certain byte values, e.g. \0 and Latin-1 text characters. In those cases, the ministr approach may yield a more compact representation that base64. Other baseX approaches (notably: basE91) may be worth the try, but would need a potentially slow, pure Javascript-based implementation. In the program that I looked at (ffmpeg), the static memory content seems to also exhibit ranges of recurrent identical byte values (often \0), which is amenable to a simple RLE encoding scheme, which could be overlayed over the ministr encoding. Not sure if this is worthwhile doing as this is essentially what gzip is doing anyway and it comes with a small runtime overhead to expand the RLE-encoded sequences. Soeren On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote: > > I think the patch is here: https://gist.github.com/evanw/11339324 > > > On 22 Dec 2014, at 15:11, Chad Austin <[email protected] <javascript:>> > wrote: > > On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected] > <javascript:>> wrote: > >> So far, my tryout implementation is based on a script that I run using >> --js-transform. It uses regular expressions to find integer arrays and >> replaces them with some base64 string and a function wrapper around them to >> turn them into an int8 array. I like the ministr approach as it preserves >> the (printable) byte sequences (thus benefitting readability of string >> literals) and apparently speeds up parsing time. If only they had provided >> their escaping code for non-printable characters. >> > > Here is the code I wrote for my tests: > https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py > > Evan pointed out that my code is incorrect in the case of an octal escape > followed by numeric digits, but I don't think he posted his code. > > >> Also, I still need to figure where exactly the "allocate([....], ...)" >> calls are generated and change the code in there. >> >> If only for the sake of speeding up the JS parser, I wonder if some basic >> inline RLE compression could be done as well. It would most probably not >> help with the gzipped file, but keep the uncompressed JS file smaller and >> potentially up parsing time at the expense of a small runtime overhead to >> expand the RLE-encoded byte sequences into a region on the heap. >> > > Hm, I wonder if the improved JS parse time would be offset by the more > complex decoding / startup JITting. Probably worth measuring. > > Either way, a straight up string literal would be a huge improvement over > the status quo for people who can't or don't want to use a separate meminit > binary file. > > Thanks for investigating this. :) > > >> Soeren >> >> >> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote: >>> >>> Hi Soeren, >>> >>> @evanw and I have done similar research in this issue: >>> https://github.com/kripken/emscripten/issues/2188 >>> >>> If we represent the meminit block as a large string literal rather than >>> an array of 8-bit numbers, it would reduce code size by about 50%, improve >>> JavaScript parse time, AND make it more readable, as C string literals >>> would be visible in the output. >>> >>> Fixing this has been on our wishlist for some time and if you want to >>> take a crack at it, we would be thrilled! >>> >>> Let me know if there's anything we can do to help, >>> Chad >>> >>> >>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]> wrote: >>> >>>> I played around with the separate memory init file and was surprised to >>>> see that it does, in fact, increase the total code size. In fact, the >>>> numbers I got are: >>>> >>>> * JS with inline memory initialization: 23186642 bytes >>>> * JS and separate memory init file: 15250276+8988744 = 24239020 bytes >>>> >>>> That's a bit surprising to me as I would expect the binary memory init >>>> file to spend one byte per, well, byte in HEAP8. Also, the inline memory >>>> initializer is a plain JS array, which is unecessarily large (each value >>>> takes at least 1-3 bytes per byte plus 1 byte for the comma). If the >>>> initial memory values were encoded as an UTF-8 string (and at runtime >>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" >>>> (=byte >>>> on the heap), only (on average if memory init values are uniformly >>>> distributed: 1.5 bytes). Of course, that would produce non-printable >>>> characters in the generated JS file. Not sure if all JS interpreters would >>>> like that. If no, base64 (or basE91 for less overhead - see >>>> http://base91.sourceforge.net/), would still use up less space in the >>>> JS file. >>>> >>>> If noone objects, I would work on implementing the latter. >>>> >>>> Soeren >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "emscripten-discuss" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> Chad Austin >>> Technical Director, IMVU >>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/> >>> http://chadaustin.me >>> >>> >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "emscripten-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Chad Austin > Technical Director, IMVU > http://engineering.imvu.com <http://www.imvu.com/members/Chad/> > http://chadaustin.me > > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "emscripten-discuss" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe > . > To unsubscribe from this group and all its topics, send an email to > [email protected] <javascript:>. > For more options, visit https://groups.google.com/d/optout. > > > Soeren Balko, PhD > Founder & Director > zfaas Pty Ltd > Brisbane, QLD > Australia > > > > -- You received this message because you are subscribed to the Google Groups "emscripten-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
