Re: [elixir-core:5665] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

eksperimental Wed, 04 May 2016 14:29:44 -0700

Hi jisaacstone,
I would be really interested in seeing what you have,
specially that 600kb file


thank you

On Wed, 4 May 2016 12:36:51 -0700 (PDT)
[email protected] wrote:

> I have done some exploratory work on this sort of thing (implementing 
> python's UnicodeData) and a _correct_ implementation is both large
> and difficult.
> 
> A full unicode properties file I created was 615k, which I think is
> about as large as the entire elixir standard library.
> 
> And that is without the east asian character set.
> 
> I think it would make a good 3rd party lib though.
> 
> I never finished the work, but I can throw it up on github if someone
> else want to finish it.
> 
> On Wednesday, May 4, 2016 at 9:05:21 AM UTC-7, Peter Marreck wrote:
> >
> > As an example of what would need to be done by necessity for proper 
> > compliance with Unicode spec, check out the "Derived Property:
> > Alphabetic" codepoint list section of this doc:
> >
> > ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
> >
> > "Total code points: 110943"
> >
> > And that's just for the "is_alphabetic?" function! (Sure, this
> > would be macroed out, but as Eric said, it would definitely
> > increase the binary size further...)
> >
> > I still think this is useful functionality (and would likely be
> > many orders of magnitude faster than relying on Regex to determine
> > these things due to Elixir/Erlang's fast function-head pattern
> > matching)
> >
> > --
> > Peter Marreck
> >
> > On Tuesday, May 3, 2016 at 6:24:33 PM UTC-4, Eric Meadows-Jönsson
> > wrote:
> >>
> >> The problem is that the Unicode module is already big, the file
> >> size of the .beam file is one of the largest in elixir. There are
> >> also issues compiling this file on systems with 512mb memory.
> >> idna, an erlang library for unicode, have similar issues on
> >> systems with low memory. Adding more functions that will need a
> >> large number of function clauses will make the issue worse and the
> >> size of the compiled elixir we distribute larger.
> >>
> >> I think it's better to have this functionality in a library until
> >> we can solve the memory issue and only have the bare necessities
> >> for unicode support in stdlib. If we later can move it into stdlib
> >> it would be good to have the API figured out and bugs fixed in
> >> another library that can iterate faster.
> >>
> >> On Tue, May 3, 2016 at 11:29 PM, eksperimental
> >> <[email protected]> wrote:
> >>
> >>> I'm not too sure if we should have all those many functions
> >>> should be added. it could be too many of them, and not easy to
> >>> extend.. but how about an Unicode.info/1 function, that returns a
> >>> tuple with information about that character. such as
> >>> iex> Unicode.info("A")
> >>> ...> {:alphanumeric, :uppercase, :ascii}
> >>>
> >>> It will be easy to improve as we find more information can be
> >>> added, such as ISO types and other groups (Specially to encodings
> >>> we are not familiar with)
> >>>
> >>> Additionally we could have check?/2 (or some better name
> >>> probably!)
> >>> iex> Unicode.check?("A", :uppercase)
> >>> ...> true
> >>> iex> Unicode.check?("A", :numeric)
> >>> ...> false
> >>>
> >>>
> >>> created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT)
> >>> [email protected] wrote:
> >>>
> >>> > I have seen multiple people (In the Elixir Slack group
> >>> > <https://elixir-lang.slack.com/archives/general/p1462294660007855>,
> >>> > on Reddit
> >>> > <
> >>> https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd
> >>> >)
> >>> > during the last couple of days requiring something that checks
> >>> > if a (possibly long) string contains e.g. only alphanumeric
> >>> > characters.
> >>> >
> >>> > It is possible to do this using regular expressions right now:
> >>> > ~r/[^[:alnum:]]/u
> >>> >
> >>> > but this is very slow.
> >>> >
> >>> > My proposal is to add the following boolean functions to the
> >>> > String module:
> >>> >
> >>> >
> >>> >    -  alphabetic?
> >>> >    -  numeric?
> >>> >    -  alphanumeric?
> >>> >    -  whitespace?
> >>> >    -  uppercase?
> >>> >    -  lowercase?
> >>> >    -  control_character?
> >>> >
> >>> >
> >>> > Function heads for these functions can probably be best
> >>> > generated by using compile-time macros similar to what other
> >>> > unicode-based functions already use.
> >>> >
> >>>
> >>> --
> >>> You received this message because you are subscribed to the
> >>> Google Groups "elixir-lang-core" group.
> >>> To unsubscribe from this group and stop receiving emails from it,
> >>> send an email to [email protected].
> >>> To view this discussion on the web visit 
> >>> https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org
> >>> .
> >>> For more options, visit https://groups.google.com/d/optout.
> >>>
> >>
> >>
> >>
> >> -- 
> >> Eric Meadows-Jönsson
> >>
> >
> 

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/20160505042913.5101a770.eksperimental%40autistici.org.
For more options, visit https://groups.google.com/d/optout.

Re: [elixir-core:5665] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

Reply via email to