Checking the length of a String requires traversing the whole string. I 
fairly common use case is checking if the String is longer or shorter than 
a given value. I might not be interested if the string is 123 456 graphemes 
long or 654 321, if all I want to know if it is longer than 50 000 
characters, but still has to calculate the full length to know if it is.

I'd propose adding String.longer? which calculates the length up to a given 
limit, and returns a boolean indicating if the string is longer than the 
limit. It could of course also be user to see if the string is shorter, 
longer-or-equal-to or shorter-or-equal-to:

String.length(string) > 10 == String.longer?(string, 10)
> String.length(string) >= 10 == String.longer?(string, 10 - 1)
> String.length(string) <= 10 == !String.longer?(string, 10)
> String.length(string) < 10 == !String.longer?(string, 10 - 1)


I'm not sure about the naming of the function, please suggest a different 
one if you'd like.

The proposal is implemented 
here: 
https://github.com/elixir-lang/elixir/compare/master...alvinlindstam:string-longer.
 
I'd be happy to send a PR if the proposal is accepted.

*Alternatives*
There is nothing stopping a user from implementing this themselves, since 
next_grapheme_size is public, but I'd guess it's such a hassle that few 
would do it.

There is no way to use this to check if a string's length is equal to a 
certain limit, or within a given range. We could use a more verbose api or 
more functions if that is desired.

One alternative is to add an optional limit paramter to String.length, 
which always returns the string's length if below the limit, but returns 
the limit or some atom if it's longer. It would be slightly more verbose to 
check the length, but enables checks for a given value or range (while 
still preventing unnecessary calculations).

*Benchmarks*
With my implementation, I get the following output from a simple benchmark. 
String.longer? seems to be few percent slower than String.length for short 
strings when the length is not above the limit, but only grows linerarly up 
to the given limit. The tests are names bu function, string length and 
limit.

Warning: The function you are trying to benchmark is super fast, making 
time measures unreliable!

Benchee won't measure individual runs but rather run it a couple of times 
and report the average back. Measures will still be correct, but the 
overhead of running it n times goes into the measurement. Also statistical 
results aren't as good, as they are based on averages now. If possible, 
increase the input size so that an individual run takes more than 10μs

Name                                    ips        average    deviation     
    median

string.length, 10                1012798.94         0.99μs   (±305.69%)     
    0.90μs

string.longer?, 10, 10           1005594.61         0.99μs   (±184.19%)     
    0.90μs

string.longer?, 10000, 10         896158.77         1.12μs   (±364.52%)     
    1.00μs

string.longer?, 10000, 5000         2298.79       435.01μs    (±44.71%)     
  390.00μs

string.length, 10000                1241.32       805.59μs    (±25.77%)     
  761.00μs


Comparison: 

string.length, 10                1012798.94

string.longer?, 10, 10           1005594.61 - 1.01x slower

string.longer?, 10000, 10         896158.77 - 1.13x slower

string.longer?, 10000, 5000         2298.79 - 440.58x slower

string.length, 10000                1241.32 - 815.90x slower

*Further optimizations*

I considered checking the byte_size of the string, hoping to find 
conditions when we could say for sure what the results would be.


I planned to return false if the byte_size was below the limit, since that 
would mean that there are less codepoints than the limit. But I'm not sure 
there are no situations where a codepoint could produce more than one 
grapheme.

I also planned to return true if byte_size was more than 4 times the limit, 
since each codepoint uses at most four bytes. But since a grapheme could 
use multiple codepoints it could also use more than four bytes, and I'm not 
sure what the upper limit is (if there is any).

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/8567063a-1ea3-420e-b8d8-dea15309a101%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to