If you want to do the code-page conversion on the JS side then I think the easiest option is to call a JS function which directly reads the bytes from the global HEAPU8 array view (which is the unsigned-byte-view into the emscripten C heap).
A pointer on the C side is simply a 32-bit index into the HEAPU8 array. I'm doing something similar here: https://github.com/floooh/sokol/blob/441f952b36f67ce446f1f21c22dcc344b7f21ed8/sokol_audio.h#L1261 ...except that instead of accessing the unsigned-byte view I'm accessing the float-view (HEAPF32) to copy audio samples from the emscripten heap into a WebAudio buffer. This way at least you have fewer things that can go wrong, and can be quite sure that the values you're reading are between 0 and 255 (I'm not sure why you'd be getting out-of-bounds values from the getValue function, if this is a bug it might be worth writing an emscripten ticket) Cheers, -Floh. On Saturday, 9 March 2019 18:30:39 UTC+1, Juergen Wothke wrote: > > I don't see why anyone would want to go back into the stone age and fiddle > with legacy C code memory management and non existing String support when I > can handle that stuff easily on the JavaScript side.. provided some > emscripten API lets me access the respective raw data without fucking it up > beyond recognition (as Pointer_stringify() or UTF8ToString() do). > > As I mentioned above this.Module.getValue(ptr++, 'i8', true); already > seems to be a suitable API to deal with this scenario (the only problem is > to find it)! > > From what you said the text that is currently in the above docs is > outdated anyway, see: > >> >> "Strings in JavaScript must be converted to pointers for compiled code – >> the relevant function is Pointer_stringify(), which given a pointer >> returns a JavaScript string" > > > > So when that doc is updated it would be a good idea to add some extra info > for those people that DON'T HAVE UTF-8 input. > Explain how to use getValue(), e.g. > -s EXTRA_EXPORTED_RUNTIME_METHODS="['getValue']" > > > PS: I still don't understand why an "i8" can be > 0xff ! > > Cheers, > Jürgen > > > Am Montag, 4. März 2019 14:44:27 UTC+1 schrieb Floh: >> >> AFAIK Pointer_stringify() has been deprecated in favour of a function >> called UTF8ToString() which takes an UTF8-encoded string in the emscripten >> HEAP and returns a JS string, maybe the docs haven't been updated yet. But >> I think (but may be wrong) it's just a renaming, and that >> Pointer_stringify() could deal with UTF-8 string before already. >> >> Since ASCII is a subset of UTF8, this would also works for proper (7-bit) >> ASCII strings. >> >> 8-bit characters with code page encoding is a different topic though, >> since code pages are pretty much legacy, and completely unknown in the web >> world I would personally prefer to not have extra code-page-aware string >> functions in the emscripten API. Instead I would convert the strings on the >> C side first from a specific code page encoding into generic UTF-8 before >> handing them over to JS. >> >> Cheers, >> -Floh. >> >> On Monday, 4 March 2019 12:43:33 UTC+1, Juergen Wothke wrote: >>> >>> I often have the situation (e.g. see >>> http://www.wothke.ch/playmod/?file=/modules/Ad%20Lib/AMusic/Admiral/mein%20erster%20versuch%20!!!.amd) >>> >>> that some legacy C program delivers some char* based String and that >>> original char buffer may be using all kinds of weird character encoding >>> schemes (ASCII, codepage 437, whatever..). >>> >>> What all these text buffers have in common is that Pointer_stringify is >>> completely unsuitable to deal with them. And yet Pointer_stringify seems to >>> be the >>> ONLY API properly advertised in the emscripten docs (see >>> https://emscripten.org/docs/porting/connecting_cpp_and_javascript/Interacting-with-code.html >>> ). >>> >>> Eventhough there actually seem to be undocumented functions available >>> (like AsciiToString, UTF8ToString, UTF16ToString, etc?) that might >>> actually be useful - at least in some of those >>> scenarios - many people are probably unaware that they exist. At one >>> point I had actually started to base64 encode my texts just so that I would >>> be able to retrieve the original uncorrupted data on >>> the JavaScript side ... which is just riddiculous.. >>> >>> The last hack I used for codepage 437 encoded strings looked like this; >>> >>> this.codeMap= [ // codepage 437 used by PC DOS and MS-DOS >>> .... >>> ]; >>> >>> cp437ToString: function(ptr) { // Pointer_stringify replacement: msdos >>> text to unicode.. >>> var str = ''; >>> while (1) { >>> var ch = this.Module.getValue(ptr++, 'i8', true); >>> if (!ch) return str; >>> str += String.fromCharCode(this.codeMap[ch& 0xff]); >>> } >>> }, >>> >>> >>> >>> Either I just missed the relevant docs for emscripten functions that >>> would be useful in these kinds of scenarios - in which case the docs should >>> maybe be impoved. Or if >>> the functionality is actually not there then I wonder why - since I can >>> hardly be the only person dealing with this kind of scenario. >>> >>> PS: I am also surprised by the Module.getValue(ptr++, 'i8', true); >>> function: >>> 'i8' seems to suggest that I should be getting a 8-bit integer and yet the >>> returned values are sometimes bigger than 0xff! ?? >>> >> -- You received this message because you are subscribed to the Google Groups "emscripten-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
