Re: [elixir-core:11552] Re: [Proposal] U+FFFD Substitution of Maximal Subparts

Kip Sat, 07 Oct 2023 13:51:39 -0700

Cameron, I think this would be a useful addition to the Unicode library 
<https://github.com/elixir-unicode/unicode> I maintain. If that works for 
you, please open an issue there and we can collaborate.  I think it being 
part of the Erlang `:unicode` module makes good sense too as José says but 
that's a longer "sales" and implementation cycle.


On Saturday, October 7, 2023 at 7:40:50 PM UTC+11 José Valim wrote:

> Hi Cameron,
>
> If the goal is to include this handling for UTF-16 and UTF-32, I suggest 
> proposing this to Erlang/OTP as new functions in the "unicode" module. 
> Otherwise, Elixir only has facilities to deal with UTF-8. You could propose 
> such a feature in their issues tracker.
>
> Also note that "rolling your own" or "depending on packages" is usually 
> not enough reasons for adding features to Elixir. Otherwise, one could 
> easily argue Decimal and Jason would be more important additions to the 
> language. :) We do describe which features we would consider part of the 
> language here: https://elixir-lang.org/development.html
>
> Other than that, awesome job on the library and benchmarks. :)
>
> On Sat, Oct 7, 2023 at 1:03 AM Kip <kipc...@gmail.com> wrote:
>
>> Your implementation is definitely fast and memory efficient so I retract 
>> my implementation comments. Now that I've run the benchmarking script and 
>> tested out a few different approaches leveraging the std lib I understand 
>> better why you've taken the approach you have. Nice work.
>>
>> On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote:
>>
>>> Cameron, I think this is a useful proposal.  Elixir has means to check 
>>> validity (String.valid?/1) and a mechanism to split valid and invalid code 
>>> points (String.chunk/2 with the :valid trait). But there isn't, to my 
>>> knowledge, a means to coerce validity.  A couple of thoughts:
>>>
>>> 1. Since Elixir strings are, by definition, UTF8, I don't know that 
>>> special handling of UTF16 and UTF32 code points makes much sense - although 
>>> I accept this may be more Unicode compliant.
>>> 2. What would the function be called? Since we have String.valid?/1 
>>> maybe String.validate/2 with an option `replace_invalid: utf8_string`. The 
>>> default `:replace_invalid` could be U+FFFD or it could be `nil`.   If 
>>> the default is `nil` then there could also be a `String.validate!/2` that 
>>> raises if there is no `:replace_invalid` option.
>>> 3. I think the implementation could leverage the code of 
>>> `String.chunk/2` which uses `String.next_codepoint/1`. That would simplify 
>>> implementation and be more consistent in code style.
>>>
>>> On Friday, October 6, 2023 at 12:24:28 PM UTC+11 cameron...@gmail.com 
>>> wrote:
>>>
>>>> As far as I can tell, neither Elixir nor Erlang have a built in 
>>>> function for replacing invalid sequences in Unicode. There's a suggested 
>>>> method on this page 
>>>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153>
>>>>  
>>>> of the Unicode standard for handling this. Several other languages (Go 
>>>> <https://pkg.go.dev/bytes#ToValidUTF8>, Python 
>>>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C# 
>>>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow this 
>>>> spec.
>>>>
>>>> Invalid Unicode's encountered frequently enough that I think it's worth 
>>>> incorporating a solution into Elixir itself. 
>>>>
>>>> Present alternatives to handling invalid unicode (and json by extension 
>>>> <https://github.com/michalmuskala/jason/issues/174>) are:
>>>>
>>>>    - Crashing (not ideal in many cases) 
>>>>    - Roll your own (lot of overhead for accidental complexity)
>>>>    - Depend on a package (+1 package towards dependency hell)
>>>>
>>>> This is my college try 
>>>> <https://github.com/Moosieus/UniRecover/tree/main>, but I'm certain 
>>>> there's a performant and far cleaner solution to be had in pure Elixir. If 
>>>> not, perhaps this is a request for OTP.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elixir-lang-core" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elixir-lang-co...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elixir-lang-core+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/3d19d843-ba6d-4c84-9942-c0ee9129be01n%40googlegroups.com.

Re: [elixir-core:11552] Re: [Proposal] U+FFFD Substitution of Maximal Subparts

Reply via email to