On Tue, Jun 11, 2019 at 1:44 PM Branko Čibej <br...@apache.org> wrote:
> We either reserve about 2x buffers for file name transliteration in heap > per thread, or we use the thread stack. As long as we trust that our utf-8 > to ucs-2 logic is rock solid and the allocations and limits are correctly > coded, this continues to be a safe approach. > > > Apropos of that, for 2.0 we're about to or have already ditched support > for versions of Windows that do not have native UTF-8/UTF-16 conversions > (ah, yes ... Windows has finally moved from UCS-2 to UTF-16). Wouldn't this > be the right time to switch to using Windows' functions instead of staying > with our own? Especially since, with the transition to UTF-16, we have to > deal correctly with surrogate pairs, something our current code (IIRC) > doesn't do. > A bit of a misnomer, the code is full of references to ucs-2 w/surrogate pair support, the combo of these is utf-16. The comments can be refreshed to today's utf-16 nomenclature. Today's logic remains correct, and of course does the correct thing, because an unpaired utf-8 surrogate value would be very broken and even possibly a security issue, much as decoding other invalid utf-8 bytestreams proved to be. If you want to look at win32 api's, feel free to benchmark; though I doubt it would outperform the current implementation.