Could this be done automatically by the compiler? Replacing `String.length str >= 13` with `String.at(str, 13) != nil`?
On Saturday, June 25, 2016 at 4:54:26 PM UTC+2, Alvin Lindstam wrote: > > Checking the length of a String requires traversing the whole string. I > fairly common use case is checking if the String is longer or shorter than > a given value. I might not be interested if the string is 123 456 graphemes > long or 654 321, if all I want to know if it is longer than 50 000 > characters, but still has to calculate the full length to know if it is. > > I'd propose adding String.longer? which calculates the length up to a > given limit, and returns a boolean indicating if the string is longer than > the limit. It could of course also be user to see if the string is shorter, > longer-or-equal-to or shorter-or-equal-to: > > String.length(string) > 10 == String.longer?(string, 10) >> String.length(string) >= 10 == String.longer?(string, 10 - 1) >> String.length(string) <= 10 == !String.longer?(string, 10) >> String.length(string) < 10 == !String.longer?(string, 10 - 1) > > > I'm not sure about the naming of the function, please suggest a different > one if you'd like. > > The proposal is implemented here: > https://github.com/elixir-lang/elixir/compare/master...alvinlindstam:string-longer. > > I'd be happy to send a PR if the proposal is accepted. > > *Alternatives* > There is nothing stopping a user from implementing this themselves, since > next_grapheme_size is public, but I'd guess it's such a hassle that few > would do it. > > There is no way to use this to check if a string's length is equal to a > certain limit, or within a given range. We could use a more verbose api or > more functions if that is desired. > > One alternative is to add an optional limit paramter to String.length, > which always returns the string's length if below the limit, but returns > the limit or some atom if it's longer. It would be slightly more verbose to > check the length, but enables checks for a given value or range (while > still preventing unnecessary calculations). > > *Benchmarks* > With my implementation, I get the following output from a simple > benchmark. String.longer? seems to be few percent slower than String.length > for short strings when the length is not above the limit, but only grows > linerarly up to the given limit. The tests are names bu function, string > length and limit. > > Warning: The function you are trying to benchmark is super fast, making > time measures unreliable! > > Benchee won't measure individual runs but rather run it a couple of times > and report the average back. Measures will still be correct, but the > overhead of running it n times goes into the measurement. Also statistical > results aren't as good, as they are based on averages now. If possible, > increase the input size so that an individual run takes more than 10μs > > Name ips average deviation > median > > string.length, 10 1012798.94 0.99μs (±305.69%) > 0.90μs > > string.longer?, 10, 10 1005594.61 0.99μs (±184.19%) > 0.90μs > > string.longer?, 10000, 10 896158.77 1.12μs (±364.52%) > 1.00μs > > string.longer?, 10000, 5000 2298.79 435.01μs (±44.71%) > 390.00μs > > string.length, 10000 1241.32 805.59μs (±25.77%) > 761.00μs > > > Comparison: > > string.length, 10 1012798.94 > > string.longer?, 10, 10 1005594.61 - 1.01x slower > > string.longer?, 10000, 10 896158.77 - 1.13x slower > > string.longer?, 10000, 5000 2298.79 - 440.58x slower > > string.length, 10000 1241.32 - 815.90x slower > > *Further optimizations* > > I considered checking the byte_size of the string, hoping to find > conditions when we could say for sure what the results would be. > > > I planned to return false if the byte_size was below the limit, since that > would mean that there are less codepoints than the limit. But I'm not sure > there are no situations where a codepoint could produce more than one > grapheme. > > I also planned to return true if byte_size was more than 4 times the > limit, since each codepoint uses at most four bytes. But since a grapheme > could use multiple codepoints it could also use more than four bytes, and > I'm not sure what the upper limit is (if there is any). > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/b888c2b5-83c2-42b3-bde0-783c0f5ae703%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
