@Alon: Found it (emscripten-fastcomp/lib/Target/JSBackend/JSBackend.cpp) and will add the feature myself. I would suggest hiding it behind a flag like "-s UTF8_MEMORY=1" or so.
Soeren On Wednesday, December 24, 2014 1:14:01 PM UTC+10, Soeren Balko wrote: > > I just submitted a pull request, which extends the "allocate" function to > accept static memory defined as an UTF-8 string, where the Unicode > character code points are the byte values: > https://github.com/kripken/emscripten/pull/3106 > > In order to replace the current representation of static memory as > Javascript arrays with compact UTF-8 strings (see my previous post), I > created a "poor man's solution", which is a simple node script that regexps > in the emscripten-generated Javascript "binary" and replaces all > "allocate([...], ...)" calls with "allocate("...", ...). The resulting > reduction in code size is quite noticeable - I did not measure the impact > on parsing times, though: > https://gist.github.com/anonymous/74196a36efbb4733a6f5 > > @Alon: Obviously, that functionality should be integrated into emscripten > itself. However, after the change to the LLVM backend, I haven't bothered > finding my way in there. Can you please suggest where to look (or simply > incorporate the functionality yourself, if that's a quick addition)? > > Happy holidays everyone, > Soeren > > On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote: >> >> Another (minor) optimization is to use the standard Javascript escapes >> \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 >> bytes if not succeeded by a digit, then the fixed-length [4 byte] hex \xYZ >> encoding must be used). >> >> Generally though, I cannot confirm that the "ministr" memory >> representation is smaller than base64. In my case, it is, in fact larger. >> Assuming a uniform distribution of byte values, the ministr representation >> in UTF-8 uses: >> >> 1 byte for the 95 "Latin 1" characters with a Unicode code point between >> U+0020...U+007E - 37.1% >> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code >> point between U+00A0...U+00FF - 37.5% >> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v >> (ignoring the \0 followed by digit case) - 2.7% >> 3 bytes for the remaining 25 characters in octal representation between >> U+0001...U+001F - 9.8% >> 4 bytes for the remaining 33 characters in hex representation between >> U+007F...U+009F - 12.9% >> >> So on average, we get some 1.985 bytes per character. In turn, base64 >> uses 1.333 bytes per character (it only uses characters that use one byte >> in UTF-8), but produces a non-human-readable memory representation. For the >> existing int8-array representation, we get the following: >> >> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) - >> 3.9% >> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) - >> 35.2% >> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three digits >> and one comma) - 60.9% >> >> On average, that yields 3.57 bytes per character. >> >> Of course, real-world static memory content is often skewed towards >> certain byte values, e.g. \0 and Latin-1 text characters. In those cases, >> the ministr approach may yield a more compact representation that base64. >> Other baseX approaches (notably: basE91) may be worth the try, but would >> need a potentially slow, pure Javascript-based implementation. >> >> In the program that I looked at (ffmpeg), the static memory content seems >> to also exhibit ranges of recurrent identical byte values (often \0), which >> is amenable to a simple RLE encoding scheme, which could be overlayed over >> the ministr encoding. Not sure if this is worthwhile doing as this is >> essentially what gzip is doing anyway and it comes with a small runtime >> overhead to expand the RLE-encoded sequences. >> >> Soeren >> >> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote: >>> >>> I think the patch is here: https://gist.github.com/evanw/11339324 >>> >>> >>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote: >>> >>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]> wrote: >>> >>>> So far, my tryout implementation is based on a script that I run using >>>> --js-transform. It uses regular expressions to find integer arrays and >>>> replaces them with some base64 string and a function wrapper around them >>>> to >>>> turn them into an int8 array. I like the ministr approach as it preserves >>>> the (printable) byte sequences (thus benefitting readability of string >>>> literals) and apparently speeds up parsing time. If only they had provided >>>> their escaping code for non-printable characters. >>>> >>> >>> Here is the code I wrote for my tests: >>> https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py >>> >>> Evan pointed out that my code is incorrect in the case of an octal >>> escape followed by numeric digits, but I don't think he posted his code. >>> >>> >>>> Also, I still need to figure where exactly the "allocate([....], ...)" >>>> calls are generated and change the code in there. >>>> >>>> If only for the sake of speeding up the JS parser, I wonder if some >>>> basic inline RLE compression could be done as well. It would most probably >>>> not help with the gzipped file, but keep the uncompressed JS file smaller >>>> and potentially up parsing time at the expense of a small runtime overhead >>>> to expand the RLE-encoded byte sequences into a region on the heap. >>>> >>> >>> Hm, I wonder if the improved JS parse time would be offset by the more >>> complex decoding / startup JITting. Probably worth measuring. >>> >>> Either way, a straight up string literal would be a huge improvement >>> over the status quo for people who can't or don't want to use a separate >>> meminit binary file. >>> >>> Thanks for investigating this. :) >>> >>> >>>> Soeren >>>> >>>> >>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote: >>>>> >>>>> Hi Soeren, >>>>> >>>>> @evanw and I have done similar research in this issue: >>>>> https://github.com/kripken/emscripten/issues/2188 >>>>> >>>>> If we represent the meminit block as a large string literal rather >>>>> than an array of 8-bit numbers, it would reduce code size by about 50%, >>>>> improve JavaScript parse time, AND make it more readable, as C string >>>>> literals would be visible in the output. >>>>> >>>>> Fixing this has been on our wishlist for some time and if you want to >>>>> take a crack at it, we would be thrilled! >>>>> >>>>> Let me know if there's anything we can do to help, >>>>> Chad >>>>> >>>>> >>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]> >>>>> wrote: >>>>> >>>>>> I played around with the separate memory init file and was surprised >>>>>> to see that it does, in fact, increase the total code size. In fact, the >>>>>> numbers I got are: >>>>>> >>>>>> * JS with inline memory initialization: 23186642 bytes >>>>>> * JS and separate memory init file: 15250276+8988744 = 24239020 bytes >>>>>> >>>>>> That's a bit surprising to me as I would expect the binary memory >>>>>> init file to spend one byte per, well, byte in HEAP8. Also, the inline >>>>>> memory initializer is a plain JS array, which is unecessarily large >>>>>> (each >>>>>> value takes at least 1-3 bytes per byte plus 1 byte for the comma). If >>>>>> the >>>>>> initial memory values were encoded as an UTF-8 string (and at runtime >>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" >>>>>> (=byte >>>>>> on the heap), only (on average if memory init values are uniformly >>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable >>>>>> characters in the generated JS file. Not sure if all JS interpreters >>>>>> would >>>>>> like that. If no, base64 (or basE91 for less overhead - see >>>>>> http://base91.sourceforge.net/), would still use up less space in >>>>>> the JS file. >>>>>> >>>>>> If noone objects, I would work on implementing the latter. >>>>>> >>>>>> Soeren >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "emscripten-discuss" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Chad Austin >>>>> Technical Director, IMVU >>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/> >>>>> http://chadaustin.me >>>>> >>>>> >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "emscripten-discuss" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> >>> -- >>> Chad Austin >>> Technical Director, IMVU >>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/> >>> http://chadaustin.me >>> >>> >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "emscripten-discuss" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe >>> . >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> Soeren Balko, PhD >>> Founder & Director >>> zfaas Pty Ltd >>> Brisbane, QLD >>> Australia >>> >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "emscripten-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
