Re: [elixir-core:5668] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

José Valim Thu, 05 May 2016 00:44:08 -0700

I think there are a couple things that could be done to reduce the file
size:


* Break the properties into multiple modules. One module for alphanumeric,
another module for a couple others, etc, etc

* Work on integer codepoints instead of binaries, i.e. support
UnicodeData.alphanumeric?(?é) instead of UnicodeData.alphanumeric("é"). If
the latter is given, you can convert it to the former. The reasoning is, if
you work with integers, you can support ranges. So instead of:

    alphanumeric?("a")
    alphanumeric?("b")
    ...
    alphanumeric?("z")

You can write:

    alphanumeric?(x) when x in ?a..?z

Ranges are slightly slower but I have pushed patches to optimize them on
Erlang 19.

Both techniques are used in Elixir's Unicode files IIRC. The ranges would
have to be extracted from the Unicode file.

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Director of R&D

On Wed, May 4, 2016 at 9:36 PM, <[email protected]> wrote:

> I have done some exploratory work on this sort of thing (implementing
> python's UnicodeData) and a _correct_ implementation is both large and
> difficult.
>
> A full unicode properties file I created was 615k, which I think is about
> as large as the entire elixir standard library.
>
> And that is without the east asian character set.
>
> I think it would make a good 3rd party lib though.
>
> I never finished the work, but I can throw it up on github if someone else
> want to finish it.
>
> On Wednesday, May 4, 2016 at 9:05:21 AM UTC-7, Peter Marreck wrote:
>>
>> As an example of what would need to be done by necessity for proper
>> compliance with Unicode spec, check out the "Derived Property: Alphabetic"
>> codepoint list section of this doc:
>>
>> ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
>>
>> "Total code points: 110943"
>>
>> And that's just for the "is_alphabetic?" function! (Sure, this would be
>> macroed out, but as Eric said, it would definitely increase the binary size
>> further...)
>>
>> I still think this is useful functionality (and would likely be many
>> orders of magnitude faster than relying on Regex to determine these things
>> due to Elixir/Erlang's fast function-head pattern matching)
>>
>> --
>> Peter Marreck
>>
>> On Tuesday, May 3, 2016 at 6:24:33 PM UTC-4, Eric Meadows-Jönsson wrote:
>>>
>>> The problem is that the Unicode module is already big, the file size of
>>> the .beam file is one of the largest in elixir. There are also issues
>>> compiling this file on systems with 512mb memory. idna, an erlang library
>>> for unicode, have similar issues on systems with low memory. Adding more
>>> functions that will need a large number of function clauses will make the
>>> issue worse and the size of the compiled elixir we distribute larger.
>>>
>>> I think it's better to have this functionality in a library until we can
>>> solve the memory issue and only have the bare necessities for unicode
>>> support in stdlib. If we later can move it into stdlib it would be good to
>>> have the API figured out and bugs fixed in another library that can iterate
>>> faster.
>>>
>>> On Tue, May 3, 2016 at 11:29 PM, eksperimental <[email protected]>
>>> wrote:
>>>
>>>> I'm not too sure if we should have all those many functions should be
>>>> added. it could be too many of them, and not easy to extend..
>>>> but how about an Unicode.info/1 function, that returns a tuple with
>>>> information about that character. such as
>>>> iex> Unicode.info("A")
>>>> ...> {:alphanumeric, :uppercase, :ascii}
>>>>
>>>> It will be easy to improve as we find more information can be added,
>>>> such as ISO types and other groups (Specially to encodings we are not
>>>> familiar with)
>>>>
>>>> Additionally we could have check?/2 (or some better name probably!)
>>>> iex> Unicode.check?("A", :uppercase)
>>>> ...> true
>>>> iex> Unicode.check?("A", :numeric)
>>>> ...> false
>>>>
>>>>
>>>> created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT)
>>>> [email protected] wrote:
>>>>
>>>> > I have seen multiple people (In the Elixir Slack group
>>>> > <https://elixir-lang.slack.com/archives/general/p1462294660007855>,
>>>> > on Reddit
>>>> > <
>>>> https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd
>>>> >)
>>>> > during the last couple of days requiring something that checks if a
>>>> > (possibly long) string contains e.g. only alphanumeric characters.
>>>> >
>>>> > It is possible to do this using regular expressions right now:
>>>> > ~r/[^[:alnum:]]/u
>>>> >
>>>> > but this is very slow.
>>>> >
>>>> > My proposal is to add the following boolean functions to the String
>>>> > module:
>>>> >
>>>> >
>>>> >    -  alphabetic?
>>>> >    -  numeric?
>>>> >    -  alphanumeric?
>>>> >    -  whitespace?
>>>> >    -  uppercase?
>>>> >    -  lowercase?
>>>> >    -  control_character?
>>>> >
>>>> >
>>>> > Function heads for these functions can probably be best generated by
>>>> > using compile-time macros similar to what other unicode-based
>>>> > functions already use.
>>>> >
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elixir-lang-core" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>> Eric Meadows-Jönsson
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "elixir-lang-core" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elixir-lang-core/3f1114cd-3110-4056-8770-bc4690930b9d%40googlegroups.com
> <https://groups.google.com/d/msgid/elixir-lang-core/3f1114cd-3110-4056-8770-bc4690930b9d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/CAGnRm4%2BHay9HVJf0EVDra2pnTY4sK5n0_mG2CRs4jEbdhmq2JA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elixir-core:5668] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

Reply via email to