Cameron, I think this would be a useful addition to the Unicode library <https://github.com/elixir-unicode/unicode> I maintain. If that works for you, please open an issue there and we can collaborate. I think it being part of the Erlang `:unicode` module makes good sense too as José says but that's a longer "sales" and implementation cycle.
On Saturday, October 7, 2023 at 7:40:50 PM UTC+11 José Valim wrote: > Hi Cameron, > > If the goal is to include this handling for UTF-16 and UTF-32, I suggest > proposing this to Erlang/OTP as new functions in the "unicode" module. > Otherwise, Elixir only has facilities to deal with UTF-8. You could propose > such a feature in their issues tracker. > > Also note that "rolling your own" or "depending on packages" is usually > not enough reasons for adding features to Elixir. Otherwise, one could > easily argue Decimal and Jason would be more important additions to the > language. :) We do describe which features we would consider part of the > language here: https://elixir-lang.org/development.html > > Other than that, awesome job on the library and benchmarks. :) > > On Sat, Oct 7, 2023 at 1:03 AM Kip <kipc...@gmail.com> wrote: > >> Your implementation is definitely fast and memory efficient so I retract >> my implementation comments. Now that I've run the benchmarking script and >> tested out a few different approaches leveraging the std lib I understand >> better why you've taken the approach you have. Nice work. >> >> On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote: >> >>> Cameron, I think this is a useful proposal. Elixir has means to check >>> validity (String.valid?/1) and a mechanism to split valid and invalid code >>> points (String.chunk/2 with the :valid trait). But there isn't, to my >>> knowledge, a means to coerce validity. A couple of thoughts: >>> >>> 1. Since Elixir strings are, by definition, UTF8, I don't know that >>> special handling of UTF16 and UTF32 code points makes much sense - although >>> I accept this may be more Unicode compliant. >>> 2. What would the function be called? Since we have String.valid?/1 >>> maybe String.validate/2 with an option `replace_invalid: utf8_string`. The >>> default `:replace_invalid` could be U+FFFD or it could be `nil`. If >>> the default is `nil` then there could also be a `String.validate!/2` that >>> raises if there is no `:replace_invalid` option. >>> 3. I think the implementation could leverage the code of >>> `String.chunk/2` which uses `String.next_codepoint/1`. That would simplify >>> implementation and be more consistent in code style. >>> >>> On Friday, October 6, 2023 at 12:24:28 PM UTC+11 cameron...@gmail.com >>> wrote: >>> >>>> As far as I can tell, neither Elixir nor Erlang have a built in >>>> function for replacing invalid sequences in Unicode. There's a suggested >>>> method on this page >>>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153> >>>> >>>> of the Unicode standard for handling this. Several other languages (Go >>>> <https://pkg.go.dev/bytes#ToValidUTF8>, Python >>>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C# >>>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow this >>>> spec. >>>> >>>> Invalid Unicode's encountered frequently enough that I think it's worth >>>> incorporating a solution into Elixir itself. >>>> >>>> Present alternatives to handling invalid unicode (and json by extension >>>> <https://github.com/michalmuskala/jason/issues/174>) are: >>>> >>>> - Crashing (not ideal in many cases) >>>> - Roll your own (lot of overhead for accidental complexity) >>>> - Depend on a package (+1 package towards dependency hell) >>>> >>>> This is my college try >>>> <https://github.com/Moosieus/UniRecover/tree/main>, but I'm certain >>>> there's a performant and far cleaner solution to be had in pure Elixir. If >>>> not, perhaps this is a request for OTP. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elixir-lang-core" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elixir-lang-co...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/3d19d843-ba6d-4c84-9942-c0ee9129be01n%40googlegroups.com.