Could this be done automatically by the compiler? Replacing `String.length 
str >= 13` with `String.at(str, 13) != nil`?

On Saturday, June 25, 2016 at 4:54:26 PM UTC+2, Alvin Lindstam wrote:
>
> Checking the length of a String requires traversing the whole string. I 
> fairly common use case is checking if the String is longer or shorter than 
> a given value. I might not be interested if the string is 123 456 graphemes 
> long or 654 321, if all I want to know if it is longer than 50 000 
> characters, but still has to calculate the full length to know if it is.
>
> I'd propose adding String.longer? which calculates the length up to a 
> given limit, and returns a boolean indicating if the string is longer than 
> the limit. It could of course also be user to see if the string is shorter, 
> longer-or-equal-to or shorter-or-equal-to:
>
> String.length(string) > 10 == String.longer?(string, 10)
>> String.length(string) >= 10 == String.longer?(string, 10 - 1)
>> String.length(string) <= 10 == !String.longer?(string, 10)
>> String.length(string) < 10 == !String.longer?(string, 10 - 1)
>
>
> I'm not sure about the naming of the function, please suggest a different 
> one if you'd like.
>
> The proposal is implemented here: 
> https://github.com/elixir-lang/elixir/compare/master...alvinlindstam:string-longer.
>  
> I'd be happy to send a PR if the proposal is accepted.
>
> *Alternatives*
> There is nothing stopping a user from implementing this themselves, since 
> next_grapheme_size is public, but I'd guess it's such a hassle that few 
> would do it.
>
> There is no way to use this to check if a string's length is equal to a 
> certain limit, or within a given range. We could use a more verbose api or 
> more functions if that is desired.
>
> One alternative is to add an optional limit paramter to String.length, 
> which always returns the string's length if below the limit, but returns 
> the limit or some atom if it's longer. It would be slightly more verbose to 
> check the length, but enables checks for a given value or range (while 
> still preventing unnecessary calculations).
>
> *Benchmarks*
> With my implementation, I get the following output from a simple 
> benchmark. String.longer? seems to be few percent slower than String.length 
> for short strings when the length is not above the limit, but only grows 
> linerarly up to the given limit. The tests are names bu function, string 
> length and limit.
>
> Warning: The function you are trying to benchmark is super fast, making 
> time measures unreliable!
>
> Benchee won't measure individual runs but rather run it a couple of times 
> and report the average back. Measures will still be correct, but the 
> overhead of running it n times goes into the measurement. Also statistical 
> results aren't as good, as they are based on averages now. If possible, 
> increase the input size so that an individual run takes more than 10μs
>
> Name                                    ips        average    deviation   
>       median
>
> string.length, 10                1012798.94         0.99μs   (±305.69%)   
>       0.90μs
>
> string.longer?, 10, 10           1005594.61         0.99μs   (±184.19%)   
>       0.90μs
>
> string.longer?, 10000, 10         896158.77         1.12μs   (±364.52%)   
>       1.00μs
>
> string.longer?, 10000, 5000         2298.79       435.01μs    (±44.71%)   
>     390.00μs
>
> string.length, 10000                1241.32       805.59μs    (±25.77%)   
>     761.00μs
>
>
> Comparison: 
>
> string.length, 10                1012798.94
>
> string.longer?, 10, 10           1005594.61 - 1.01x slower
>
> string.longer?, 10000, 10         896158.77 - 1.13x slower
>
> string.longer?, 10000, 5000         2298.79 - 440.58x slower
>
> string.length, 10000                1241.32 - 815.90x slower
>
> *Further optimizations*
>
> I considered checking the byte_size of the string, hoping to find 
> conditions when we could say for sure what the results would be.
>
>
> I planned to return false if the byte_size was below the limit, since that 
> would mean that there are less codepoints than the limit. But I'm not sure 
> there are no situations where a codepoint could produce more than one 
> grapheme.
>
> I also planned to return true if byte_size was more than 4 times the 
> limit, since each codepoint uses at most four bytes. But since a grapheme 
> could use multiple codepoints it could also use more than four bytes, and 
> I'm not sure what the upper limit is (if there is any).
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/b888c2b5-83c2-42b3-bde0-783c0f5ae703%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to