Thanks, I'll take a look at those pulls now. Let's move the discussion to there.
- Alon On Fri, Dec 26, 2014 at 9:52 PM, Soeren Balko <[email protected]> wrote: > I just opened two pull requests for the incoming branches of the > emscripten-fastcomp and emscripten repositories: > https://github.com/kripken/emscripten-fastcomp/pull/57, > https://github.com/kripken/emscripten/pull/3106. These patches take care > of rendering statically allocated memory as an (escaped) UTF8 string in the > backend. In order to enable the functionality, I added the configuration > option "UTF8_STATIC_MEMORY" (see settings.js). It's on by default. When set > to 0, it will generate the static memory as before (i.e., as JS arrays of > integers, representing byte values). > > Enjoy, > Soeren > > > On Thursday, December 25, 2014 2:57:53 PM UTC+10, Soeren Balko wrote: >> >> @Alon: Found it (emscripten-fastcomp/lib/Target/JSBackend/JSBackend.cpp) >> and will add the feature myself. I would suggest hiding it behind a flag >> like "-s UTF8_MEMORY=1" or so. >> >> Soeren >> >> On Wednesday, December 24, 2014 1:14:01 PM UTC+10, Soeren Balko wrote: >>> >>> I just submitted a pull request, which extends the "allocate" function >>> to accept static memory defined as an UTF-8 string, where the Unicode >>> character code points are the byte values: https://github.com/kripken/ >>> emscripten/pull/3106 >>> >>> In order to replace the current representation of static memory as >>> Javascript arrays with compact UTF-8 strings (see my previous post), I >>> created a "poor man's solution", which is a simple node script that regexps >>> in the emscripten-generated Javascript "binary" and replaces all >>> "allocate([...], ...)" calls with "allocate("...", ...). The resulting >>> reduction in code size is quite noticeable - I did not measure the impact >>> on parsing times, though: https://gist.github.com/anonymous/ >>> 74196a36efbb4733a6f5 >>> >>> @Alon: Obviously, that functionality should be integrated into >>> emscripten itself. However, after the change to the LLVM backend, I haven't >>> bothered finding my way in there. Can you please suggest where to look (or >>> simply incorporate the functionality yourself, if that's a quick addition)? >>> >>> Happy holidays everyone, >>> Soeren >>> >>> On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote: >>>> >>>> Another (minor) optimization is to use the standard Javascript escapes >>>> \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 >>>> bytes if not succeeded by a digit, then the fixed-length [4 byte] hex \xYZ >>>> encoding must be used). >>>> >>>> Generally though, I cannot confirm that the "ministr" memory >>>> representation is smaller than base64. In my case, it is, in fact larger. >>>> Assuming a uniform distribution of byte values, the ministr representation >>>> in UTF-8 uses: >>>> >>>> 1 byte for the 95 "Latin 1" characters with a Unicode code point >>>> between U+0020...U+007E - 37.1% >>>> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code >>>> point between U+00A0...U+00FF - 37.5% >>>> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, >>>> \v (ignoring the \0 followed by digit case) - 2.7% >>>> 3 bytes for the remaining 25 characters in octal representation between >>>> U+0001...U+001F - 9.8% >>>> 4 bytes for the remaining 33 characters in hex representation between >>>> U+007F...U+009F - 12.9% >>>> >>>> So on average, we get some 1.985 bytes per character. In turn, base64 >>>> uses 1.333 bytes per character (it only uses characters that use one byte >>>> in UTF-8), but produces a non-human-readable memory representation. For the >>>> existing int8-array representation, we get the following: >>>> >>>> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) >>>> - 3.9% >>>> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) >>>> - 35.2% >>>> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three >>>> digits and one comma) - 60.9% >>>> >>>> On average, that yields 3.57 bytes per character. >>>> >>>> Of course, real-world static memory content is often skewed towards >>>> certain byte values, e.g. \0 and Latin-1 text characters. In those cases, >>>> the ministr approach may yield a more compact representation that base64. >>>> Other baseX approaches (notably: basE91) may be worth the try, but would >>>> need a potentially slow, pure Javascript-based implementation. >>>> >>>> In the program that I looked at (ffmpeg), the static memory content >>>> seems to also exhibit ranges of recurrent identical byte values (often \0), >>>> which is amenable to a simple RLE encoding scheme, which could be overlayed >>>> over the ministr encoding. Not sure if this is worthwhile doing as this is >>>> essentially what gzip is doing anyway and it comes with a small runtime >>>> overhead to expand the RLE-encoded sequences. >>>> >>>> Soeren >>>> >>>> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote: >>>>> >>>>> I think the patch is here: https://gist.github.com/evanw/11339324 >>>>> >>>>> >>>>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote: >>>>> >>>>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]> >>>>> wrote: >>>>> >>>>>> So far, my tryout implementation is based on a script that I run >>>>>> using --js-transform. It uses regular expressions to find integer arrays >>>>>> and replaces them with some base64 string and a function wrapper around >>>>>> them to turn them into an int8 array. I like the ministr approach as it >>>>>> preserves the (printable) byte sequences (thus benefitting readability of >>>>>> string literals) and apparently speeds up parsing time. If only they had >>>>>> provided their escaping code for non-printable characters. >>>>>> >>>>> >>>>> Here is the code I wrote for my tests: https://github.com/chadaustin/ >>>>> Web-Benchmarks/blob/master/meminit/meminit.py >>>>> >>>>> Evan pointed out that my code is incorrect in the case of an octal >>>>> escape followed by numeric digits, but I don't think he posted his code. >>>>> >>>>> >>>>>> Also, I still need to figure where exactly the "allocate([....], >>>>>> ...)" calls are generated and change the code in there. >>>>>> >>>>>> If only for the sake of speeding up the JS parser, I wonder if some >>>>>> basic inline RLE compression could be done as well. It would most >>>>>> probably >>>>>> not help with the gzipped file, but keep the uncompressed JS file smaller >>>>>> and potentially up parsing time at the expense of a small runtime >>>>>> overhead >>>>>> to expand the RLE-encoded byte sequences into a region on the heap. >>>>>> >>>>> >>>>> Hm, I wonder if the improved JS parse time would be offset by the more >>>>> complex decoding / startup JITting. Probably worth measuring. >>>>> >>>>> Either way, a straight up string literal would be a huge improvement >>>>> over the status quo for people who can't or don't want to use a separate >>>>> meminit binary file. >>>>> >>>>> Thanks for investigating this. :) >>>>> >>>>> >>>>>> Soeren >>>>>> >>>>>> >>>>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote: >>>>>>> >>>>>>> Hi Soeren, >>>>>>> >>>>>>> @evanw and I have done similar research in this issue: >>>>>>> https://github.com/kripken/emscripten/issues/2188 >>>>>>> >>>>>>> If we represent the meminit block as a large string literal rather >>>>>>> than an array of 8-bit numbers, it would reduce code size by about 50%, >>>>>>> improve JavaScript parse time, AND make it more readable, as C string >>>>>>> literals would be visible in the output. >>>>>>> >>>>>>> Fixing this has been on our wishlist for some time and if you want >>>>>>> to take a crack at it, we would be thrilled! >>>>>>> >>>>>>> Let me know if there's anything we can do to help, >>>>>>> Chad >>>>>>> >>>>>>> >>>>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I played around with the separate memory init file and was >>>>>>>> surprised to see that it does, in fact, increase the total code size. >>>>>>>> In >>>>>>>> fact, the numbers I got are: >>>>>>>> >>>>>>>> * JS with inline memory initialization: 23186642 bytes >>>>>>>> * JS and separate memory init file: 15250276+8988744 = 24239020 >>>>>>>> bytes >>>>>>>> >>>>>>>> That's a bit surprising to me as I would expect the binary memory >>>>>>>> init file to spend one byte per, well, byte in HEAP8. Also, the inline >>>>>>>> memory initializer is a plain JS array, which is unecessarily large >>>>>>>> (each >>>>>>>> value takes at least 1-3 bytes per byte plus 1 byte for the comma). If >>>>>>>> the >>>>>>>> initial memory values were encoded as an UTF-8 string (and at runtime >>>>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" >>>>>>>> (=byte >>>>>>>> on the heap), only (on average if memory init values are uniformly >>>>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable >>>>>>>> characters in the generated JS file. Not sure if all JS interpreters >>>>>>>> would >>>>>>>> like that. If no, base64 (or basE91 for less overhead - see >>>>>>>> http://base91.sourceforge.net/), would still use up less space in >>>>>>>> the JS file. >>>>>>>> >>>>>>>> If noone objects, I would work on implementing the latter. >>>>>>>> >>>>>>>> Soeren >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "emscripten-discuss" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Chad Austin >>>>>>> Technical Director, IMVU >>>>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/> >>>>>>> http://chadaustin.me >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "emscripten-discuss" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Chad Austin >>>>> Technical Director, IMVU >>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/> >>>>> http://chadaustin.me >>>>> >>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "emscripten-discuss" group. >>>>> To unsubscribe from this topic, visit https://groups.google.com/d/ >>>>> topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe. >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected]. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>>> >>>>> Soeren Balko, PhD >>>>> Founder & Director >>>>> zfaas Pty Ltd >>>>> Brisbane, QLD >>>>> Australia >>>>> >>>>> >>>>> >>>>> -- > You received this message because you are subscribed to the Google Groups > "emscripten-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "emscripten-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
