@Alon: Found it (emscripten-fastcomp/lib/Target/JSBackend/JSBackend.cpp) 
and will add the feature myself. I would suggest hiding it behind a flag 
like "-s UTF8_MEMORY=1" or so. 

Soeren

On Wednesday, December 24, 2014 1:14:01 PM UTC+10, Soeren Balko wrote:
>
> I just submitted a pull request, which extends the "allocate" function to 
> accept static memory defined as an UTF-8 string, where the Unicode 
> character code points are the byte values: 
> https://github.com/kripken/emscripten/pull/3106
>
> In order to replace the current representation of static memory as 
> Javascript arrays with compact UTF-8 strings (see my previous post), I 
> created a "poor man's solution", which is a simple node script that regexps 
> in the emscripten-generated Javascript "binary" and replaces all 
> "allocate([...], ...)" calls with "allocate("...", ...). The resulting 
> reduction in code size is quite noticeable - I did not measure the impact 
> on parsing times, though: 
> https://gist.github.com/anonymous/74196a36efbb4733a6f5
>
> @Alon: Obviously, that functionality should be integrated into emscripten 
> itself. However, after the change to the LLVM backend, I haven't bothered 
> finding my way in there. Can you please suggest where to look (or simply 
> incorporate the functionality yourself, if that's a quick addition)?
>
> Happy holidays everyone,
> Soeren
>
> On Tuesday, December 23, 2014 9:26:50 AM UTC+10, Soeren Balko wrote:
>>
>> Another (minor) optimization is to use the standard Javascript escapes 
>> \t, \b, \f, \n, \r, and \v (2 bytes, each) instead of octal sequences (3 
>> bytes if not succeeded by a digit, then the fixed-length [4 byte] hex  \xYZ 
>> encoding must be used). 
>>
>> Generally though, I cannot confirm that the "ministr" memory 
>> representation is smaller than base64. In my case, it is, in fact larger. 
>> Assuming a uniform distribution of byte values, the ministr representation 
>> in UTF-8 uses:
>>
>> 1 byte for the 95 "Latin 1" characters with a Unicode code point between 
>> U+0020...U+007E - 37.1%
>> 2 bytes for the 96 "Latin 1 Supplement" characters with a Unicode code 
>> point between U+00A0...U+00FF - 37.5%
>> 2 bytes for the 7 Javascript escape sequences \0, \t, \b, \f, \n, \t, \v 
>> (ignoring the \0 followed by digit case) - 2.7%
>> 3 bytes for the remaining 25 characters in octal representation between 
>> U+0001...U+001F - 9.8%
>> 4 bytes for the remaining 33 characters in hex representation between 
>> U+007F...U+009F - 12.9%
>>
>> So on average, we get some 1.985 bytes per character. In turn, base64 
>> uses 1.333 bytes per character (it only uses characters that use one byte 
>> in UTF-8), but produces a non-human-readable memory representation. For the 
>> existing int8-array representation, we get the following:
>>
>> 2 bytes for 10 characters in U+0000...U+0009 (one digit and one comma) - 
>> 3.9%
>> 3 bytes for 90 characters in U+000A...U+0063 (two digits and one comma) - 
>> 35.2%
>> 4 bytes for the remaining 156 characters in U+0064...U+00FF (three digits 
>> and one comma) - 60.9%
>>
>> On average, that yields 3.57 bytes per character. 
>>
>> Of course, real-world static memory content is often skewed towards 
>> certain byte values, e.g. \0 and Latin-1 text characters. In those cases, 
>> the ministr approach may yield a more compact representation that base64. 
>> Other baseX approaches (notably: basE91) may be worth the try, but would 
>> need a potentially slow, pure Javascript-based implementation. 
>>
>> In the program that I looked at (ffmpeg), the static memory content seems 
>> to also exhibit ranges of recurrent identical byte values (often \0), which 
>> is amenable to a simple RLE encoding scheme, which could be overlayed over 
>> the ministr encoding. Not sure if this is worthwhile doing as this is 
>> essentially what gzip is doing anyway and it comes with a small runtime 
>> overhead to expand the RLE-encoded sequences.
>>
>> Soeren
>>
>> On Monday, December 22, 2014 3:13:10 PM UTC+10, Sören Balko wrote:
>>>
>>> I think the patch is here: https://gist.github.com/evanw/11339324
>>>
>>>
>>> On 22 Dec 2014, at 15:11, Chad Austin <[email protected]> wrote:
>>>
>>> On Sun, Dec 21, 2014 at 10:22 PM, Soeren Balko <[email protected]> wrote:
>>>
>>>> So far, my tryout implementation is based on a script that I run using 
>>>> --js-transform. It uses regular expressions to find integer arrays and 
>>>> replaces them with some base64 string and a function wrapper around them 
>>>> to 
>>>> turn them into an int8 array. I like the ministr approach as it preserves 
>>>> the (printable) byte sequences (thus benefitting readability of string 
>>>> literals) and apparently speeds up parsing time. If only they had provided 
>>>> their escaping code for non-printable characters. 
>>>>
>>>
>>> Here is the code I wrote for my tests: 
>>> https://github.com/chadaustin/Web-Benchmarks/blob/master/meminit/meminit.py
>>>
>>> Evan pointed out that my code is incorrect in the case of an octal 
>>> escape followed by numeric digits, but I don't think he posted his code.
>>>  
>>>
>>>> Also, I still need to figure where exactly the "allocate([....], ...)" 
>>>> calls are generated and change the code in there.
>>>>
>>>> If only for the sake of speeding up the JS parser, I wonder if some 
>>>> basic inline RLE compression could be done as well. It would most probably 
>>>> not help with the gzipped file, but keep the uncompressed JS file smaller 
>>>> and potentially up parsing time at the expense of a small runtime overhead 
>>>> to expand the RLE-encoded byte sequences into a region on the heap. 
>>>>
>>>
>>> Hm, I wonder if the improved JS parse time would be offset by the more 
>>> complex decoding / startup JITting.  Probably worth measuring.
>>>
>>> Either way, a straight up string literal would be a huge improvement 
>>> over the status quo for people who can't or don't want to use a separate 
>>> meminit binary file.
>>>
>>> Thanks for investigating this.  :)
>>>
>>>
>>>> Soeren
>>>>
>>>>
>>>> On Monday, December 22, 2014 7:58:26 AM UTC+10, Chad Austin wrote:
>>>>>
>>>>> Hi Soeren,
>>>>>
>>>>> @evanw and I have done similar research in this issue: 
>>>>> https://github.com/kripken/emscripten/issues/2188
>>>>>
>>>>> If we represent the meminit block as a large string literal rather 
>>>>> than an array of 8-bit numbers, it would reduce code size by about 50%, 
>>>>> improve JavaScript parse time, AND make it more readable, as C string 
>>>>> literals would be visible in the output.
>>>>>
>>>>> Fixing this has been on our wishlist for some time and if you want to 
>>>>> take a crack at it, we would be thrilled!
>>>>>
>>>>> Let me know if there's anything we can do to help,
>>>>> Chad
>>>>>
>>>>>
>>>>> On Sat, Dec 20, 2014 at 11:48 PM, Soeren Balko <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> I played around with the separate memory init file and was surprised 
>>>>>> to see that it does, in fact, increase the total code size. In fact, the 
>>>>>> numbers I got are:
>>>>>>
>>>>>> * JS with inline memory initialization: 23186642 bytes
>>>>>> * JS and separate memory init file:  15250276+8988744 = 24239020 bytes
>>>>>>
>>>>>> That's a bit surprising to me as I would expect the binary memory 
>>>>>> init file to spend one byte per, well, byte in HEAP8. Also, the inline 
>>>>>> memory initializer is a plain JS array, which is unecessarily large 
>>>>>> (each 
>>>>>> value takes at least 1-3 bytes per byte plus 1 byte for the comma). If 
>>>>>> the 
>>>>>> initial memory values were encoded as an UTF-8 string (and at runtime 
>>>>>> retrieved using String.charCodeAt), there were 1-2 bytes per "entry" 
>>>>>> (=byte 
>>>>>> on the heap), only (on average if memory init values are uniformly 
>>>>>> distributed: 1.5 bytes). Of course, that would produce non-printable 
>>>>>> characters in the generated JS file. Not sure if all JS interpreters 
>>>>>> would 
>>>>>> like that. If no, base64 (or basE91 for less overhead - see 
>>>>>> http://base91.sourceforge.net/), would still use up less space in 
>>>>>> the JS file. 
>>>>>>
>>>>>> If noone objects, I would work on implementing the latter.
>>>>>>
>>>>>> Soeren
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "emscripten-discuss" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Chad Austin
>>>>> Technical Director, IMVU
>>>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>>>> http://chadaustin.me
>>>>>
>>>>>
>>>>>  
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "emscripten-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> -- 
>>> Chad Austin
>>> Technical Director, IMVU
>>> http://engineering.imvu.com <http://www.imvu.com/members/Chad/>
>>> http://chadaustin.me
>>>
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "emscripten-discuss" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/emscripten-discuss/ZmEdtOXH3QQ/unsubscribe
>>> .
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> Soeren Balko, PhD
>>> Founder & Director
>>> zfaas Pty Ltd
>>> Brisbane, QLD
>>> Australia
>>>
>>>
>>>  
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"emscripten-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to