Yeah, so I would go with the full coverage route. If you want to provide a PR, it will be welcome. Thank you and sorry for the delay!
On Tue, Oct 31, 2023 at 11:48 PM Cameron Duley <cameron.dule...@gmail.com> wrote: > I'd originally looked for tests in the spec, browsers, and other languages > to compare against. > > W3C's current test suite doesn't seem comprehensive: > > https://github.com/web-platform-tests/wpt/blob/master/encoding/replacement-encodings.any.js > > Go has a few token tests that are relatively intelligible: > > https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/bytes/bytes_test.go;l=1157 > > On Tue, Oct 31, 2023 at 2:09 PM José Valim <jose.va...@dashbit.co> wrote: > >> Does the specification provide tests for us to include? Otherwise we can >> include enough tests for full line coverage and a “brute force”/property >> test commented out. >> >> I would say the name “replace_invalid” is excellent. >> >> On Tue, Oct 31, 2023 at 18:52 Cameron Duley <cameron.dule...@gmail.com> >> wrote: >> >>> This was the final version I'd landed on for UTF-8: >>> >>> https://github.com/elixir-unicode/unicode/blob/main/lib/unicode/validation/utf8.ex >>> >>> Along with the following modules for testing: >>> >>> https://github.com/elixir-unicode/unicode/blob/main/test/support/unicode_validation_helpers.ex >>> >>> https://github.com/elixir-unicode/unicode/blob/main/test/unicode_validation_test.exs >>> >>> I think it's ideal functionality to have in the String module, and the >>> implementation's "reasonable enough" until a native solution's available in >>> OTP. >>> >>> Testing is my only uncertainty - How much is prudent and in what manner? >>> On Tuesday, October 31, 2023 at 12:35:48 PM UTC-4 José Valim wrote: >>> >>>> Folks, I am following up on this, where did we land? >>>> >>>> The new implementation is roughly ~70LOC for UTF-8, so at first I don't >>>> see an issue with adding it to Elixir. However, the Elixir version would be >>>> UTF-8 only (part of the String module). >>>> >>>> Thoughts? >>>> >>>> On Saturday, October 7, 2023 at 10:51:34 PM UTC+2 Kip wrote: >>>> >>>>> Cameron, I think this would be a useful addition to the Unicode >>>>> library <https://github.com/elixir-unicode/unicode> I maintain. If >>>>> that works for you, please open an issue there and we can collaborate. I >>>>> think it being part of the Erlang `:unicode` module makes good sense too >>>>> as >>>>> José says but that's a longer "sales" and implementation cycle. >>>>> >>>>> On Saturday, October 7, 2023 at 7:40:50 PM UTC+11 José Valim wrote: >>>>> >>>>>> Hi Cameron, >>>>>> >>>>>> If the goal is to include this handling for UTF-16 and UTF-32, I >>>>>> suggest proposing this to Erlang/OTP as new functions in the "unicode" >>>>>> module. Otherwise, Elixir only has facilities to deal with UTF-8. You >>>>>> could >>>>>> propose such a feature in their issues tracker. >>>>>> >>>>>> Also note that "rolling your own" or "depending on packages" is >>>>>> usually not enough reasons for adding features to Elixir. Otherwise, one >>>>>> could easily argue Decimal and Jason would be more important additions to >>>>>> the language. :) We do describe which features we would consider part of >>>>>> the language here: https://elixir-lang.org/development.html >>>>>> >>>>>> Other than that, awesome job on the library and benchmarks. :) >>>>>> >>>>>> On Sat, Oct 7, 2023 at 1:03 AM Kip <kipc...@gmail.com> wrote: >>>>>> >>>>>>> Your implementation is definitely fast and memory efficient so I >>>>>>> retract my implementation comments. Now that I've run the benchmarking >>>>>>> script and tested out a few different approaches leveraging the std lib >>>>>>> I >>>>>>> understand better why you've taken the approach you have. Nice work. >>>>>>> >>>>>>> On Saturday, October 7, 2023 at 9:26:37 AM UTC+11 Kip wrote: >>>>>>> >>>>>>>> Cameron, I think this is a useful proposal. Elixir has means to >>>>>>>> check validity (String.valid?/1) and a mechanism to split valid and >>>>>>>> invalid >>>>>>>> code points (String.chunk/2 with the :valid trait). But there isn't, >>>>>>>> to my >>>>>>>> knowledge, a means to coerce validity. A couple of thoughts: >>>>>>>> >>>>>>>> 1. Since Elixir strings are, by definition, UTF8, I don't know that >>>>>>>> special handling of UTF16 and UTF32 code points makes much sense - >>>>>>>> although >>>>>>>> I accept this may be more Unicode compliant. >>>>>>>> 2. What would the function be called? Since we have String.valid?/1 >>>>>>>> maybe String.validate/2 with an option `replace_invalid: utf8_string`. >>>>>>>> The >>>>>>>> default `:replace_invalid` could be U+FFFD or it could be `nil`. >>>>>>>> If the default is `nil` then there could also be a `String.validate!/2` >>>>>>>> that raises if there is no `:replace_invalid` option. >>>>>>>> 3. I think the implementation could leverage the code of >>>>>>>> `String.chunk/2` which uses `String.next_codepoint/1`. That would >>>>>>>> simplify >>>>>>>> implementation and be more consistent in code style. >>>>>>>> >>>>>>>> On Friday, October 6, 2023 at 12:24:28 PM UTC+11 >>>>>>>> cameron...@gmail.com wrote: >>>>>>>> >>>>>>>>> As far as I can tell, neither Elixir nor Erlang have a built in >>>>>>>>> function for replacing invalid sequences in Unicode. There's a >>>>>>>>> suggested >>>>>>>>> method on this page >>>>>>>>> <https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=153> >>>>>>>>> of the Unicode standard for handling this. Several other languages ( >>>>>>>>> Go <https://pkg.go.dev/bytes#ToValidUTF8>, Python >>>>>>>>> <https://docs.python.org/3/library/stdtypes.html#bytes.decode>, C# >>>>>>>>> <https://github.com/dotnet/docs/issues/13547>, etc) now follow >>>>>>>>> this spec. >>>>>>>>> >>>>>>>>> Invalid Unicode's encountered frequently enough that I think it's >>>>>>>>> worth incorporating a solution into Elixir itself. >>>>>>>>> >>>>>>>>> Present alternatives to handling invalid unicode (and json by >>>>>>>>> extension <https://github.com/michalmuskala/jason/issues/174>) >>>>>>>>> are: >>>>>>>>> >>>>>>>>> - Crashing (not ideal in many cases) >>>>>>>>> - Roll your own (lot of overhead for accidental complexity) >>>>>>>>> - Depend on a package (+1 package towards dependency hell) >>>>>>>>> >>>>>>>>> This is my college try >>>>>>>>> <https://github.com/Moosieus/UniRecover/tree/main>, but I'm >>>>>>>>> certain there's a performant and far cleaner solution to be had in >>>>>>>>> pure >>>>>>>>> Elixir. If not, perhaps this is a request for OTP. >>>>>>>>> >>>>>>>> >>>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "elixir-lang-core" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to elixir-lang-co...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/elixir-lang-core/197620a2-6a96-41c6-a6e7-5da03e351080n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elixir-lang-core" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to elixir-lang-core+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elixir-lang-core/26e4c761-caf0-46ea-ab08-a33407febdb8n%40googlegroups.com >>> <https://groups.google.com/d/msgid/elixir-lang-core/26e4c761-caf0-46ea-ab08-a33407febdb8n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "elixir-lang-core" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elixir-lang-core+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LKEOb%3D7WMeUhots27dj%3DAwvhLF3o%2Bo7Q2%2BhZXMNQi8YA%40mail.gmail.com >> <https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4LKEOb%3D7WMeUhots27dj%3DAwvhLF3o%2Bo7Q2%2BhZXMNQi8YA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > > > -- > Thanks, > > Cameron Duley > > -- > You received this message because you are subscribed to the Google Groups > "elixir-lang-core" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elixir-lang-core+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elixir-lang-core/CAH7jZqe%3D%3DKUPG6O3v_%3DSr%3DAjzv28X5x1QvpF-bqJ96XWOVPu5w%40mail.gmail.com > <https://groups.google.com/d/msgid/elixir-lang-core/CAH7jZqe%3D%3DKUPG6O3v_%3DSr%3DAjzv28X5x1QvpF-bqJ96XWOVPu5w%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-core+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4KkipvSpP%3DaP7UKcXMs_BAXgCf6N%3D8OUVsP7Qv6Zp1Xaw%40mail.gmail.com.