[elixir-core:11550] Re: [Proposal] U+FFFD Substitution of Maximal Subparts

Kip Fri, 06 Oct 2023 16:03:25 -0700

Your implementation is definitely fast and memory efficient so I retract my 
implementation comments. Now that I've run the benchmarking script and 
tested out a few different approaches leveraging the std lib I understand 
better why you've taken the approach you have. Nice work.


On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote:

> Cameron, I think this is a useful proposal.  Elixir has means to check 
> validity (String.valid?/1) and a mechanism to split valid and invalid code 
> points (String.chunk/2 with the :valid trait). But there isn't, to my 
> knowledge, a means to coerce validity.  A couple of thoughts:
>
> 1. Since Elixir strings are, by definition, UTF8, I don't know that 
> special handling of UTF16 and UTF32 code points makes much sense - although 
> I accept this may be more Unicode compliant.
> 2. What would the function be called? Since we have String.valid?/1 maybe 
> String.validate/2 with an option `replace_invalid: utf8_string`. The 
> default `:replace_invalid` could be U+FFFD or it could be `nil`.   If the 
> default is `nil` then there could also be a `String.validate!/2` that 
> raises if there is no `:replace_invalid` option.
> 3. I think the implementation could leverage the code of `String.chunk/2` 
> which uses `String.next_codepoint/1`. That would simplify implementation 
> and be more consistent in code style.
>
> On Friday, October 6, 2023 at 12:24:28 PM UTC+11 cameron...@gmail.com 
> wrote:
>
>> As far as I can tell, neither Elixir nor Erlang have a built in function 
>> for replacing invalid sequences in Unicode. There's a suggested method on 
>> this page 
>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153>
>>  
>> of the Unicode standard for handling this. Several other languages (Go 
>> <https://pkg.go.dev/bytes#ToValidUTF8>, Python 
>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C# 
>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow this spec.
>>
>> Invalid Unicode's encountered frequently enough that I think it's worth 
>> incorporating a solution into Elixir itself. 
>>
>> Present alternatives to handling invalid unicode (and json by extension 
>> <https://github.com/michalmuskala/jason/issues/174>) are:
>>
>>    - Crashing (not ideal in many cases) 
>>    - Roll your own (lot of overhead for accidental complexity)
>>    - Depend on a package (+1 package towards dependency hell)
>>
>> This is my college try <https://github.com/Moosieus/UniRecover/tree/main>, 
>> but I'm certain there's a performant and far cleaner solution to be had in 
>> pure Elixir. If not, perhaps this is a request for OTP.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com.

[elixir-core:11550] Re: [Proposal] U+FFFD Substitution of Maximal Subparts

Reply via email to