Re: [elixir-core:5688] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

jisaacstone Fri, 06 May 2016 12:28:57 -0700

Hey sorry for the late reply

just pushed the code to github.


The code to create the file is here:

https://github.com/jisaacstone/ex_unicodedata/blob/master/lib/data/codepoint.ex

Could probably reduce the size somewhat by using tuples instead of structs.

 -isaac

On Wednesday, May 4, 2016 at 2:29:19 PM UTC-7, eksperimental wrote:
>
> Hi jisaacstone, 
> I would be really interested in seeing what you have, 
> specially that 600kb file 
>
> thank you 
>
> On Wed, 4 May 2016 12:36:51 -0700 (PDT) 
> [email protected] <javascript:> wrote: 
>
> > I have done some exploratory work on this sort of thing (implementing 
> > python's UnicodeData) and a _correct_ implementation is both large 
> > and difficult. 
> > 
> > A full unicode properties file I created was 615k, which I think is 
> > about as large as the entire elixir standard library. 
> > 
> > And that is without the east asian character set. 
> > 
> > I think it would make a good 3rd party lib though. 
> > 
> > I never finished the work, but I can throw it up on github if someone 
> > else want to finish it. 
> > 
> > On Wednesday, May 4, 2016 at 9:05:21 AM UTC-7, Peter Marreck wrote: 
> > > 
> > > As an example of what would need to be done by necessity for proper 
> > > compliance with Unicode spec, check out the "Derived Property: 
> > > Alphabetic" codepoint list section of this doc: 
> > > 
> > > ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt 
> > > 
> > > "Total code points: 110943" 
> > > 
> > > And that's just for the "is_alphabetic?" function! (Sure, this 
> > > would be macroed out, but as Eric said, it would definitely 
> > > increase the binary size further...) 
> > > 
> > > I still think this is useful functionality (and would likely be 
> > > many orders of magnitude faster than relying on Regex to determine 
> > > these things due to Elixir/Erlang's fast function-head pattern 
> > > matching) 
> > > 
> > > -- 
> > > Peter Marreck 
> > > 
> > > On Tuesday, May 3, 2016 at 6:24:33 PM UTC-4, Eric Meadows-Jönsson 
> > > wrote: 
> > >> 
> > >> The problem is that the Unicode module is already big, the file 
> > >> size of the .beam file is one of the largest in elixir. There are 
> > >> also issues compiling this file on systems with 512mb memory. 
> > >> idna, an erlang library for unicode, have similar issues on 
> > >> systems with low memory. Adding more functions that will need a 
> > >> large number of function clauses will make the issue worse and the 
> > >> size of the compiled elixir we distribute larger. 
> > >> 
> > >> I think it's better to have this functionality in a library until 
> > >> we can solve the memory issue and only have the bare necessities 
> > >> for unicode support in stdlib. If we later can move it into stdlib 
> > >> it would be good to have the API figured out and bugs fixed in 
> > >> another library that can iterate faster. 
> > >> 
> > >> On Tue, May 3, 2016 at 11:29 PM, eksperimental 
> > >> <[email protected]> wrote: 
> > >> 
> > >>> I'm not too sure if we should have all those many functions 
> > >>> should be added. it could be too many of them, and not easy to 
> > >>> extend.. but how about an Unicode.info/1 function, that returns a 
> > >>> tuple with information about that character. such as 
> > >>> iex> Unicode.info("A") 
> > >>> ...> {:alphanumeric, :uppercase, :ascii} 
> > >>> 
> > >>> It will be easy to improve as we find more information can be 
> > >>> added, such as ISO types and other groups (Specially to encodings 
> > >>> we are not familiar with) 
> > >>> 
> > >>> Additionally we could have check?/2 (or some better name 
> > >>> probably!) 
> > >>> iex> Unicode.check?("A", :uppercase) 
> > >>> ...> true 
> > >>> iex> Unicode.check?("A", :numeric) 
> > >>> ...> false 
> > >>> 
> > >>> 
> > >>> created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT) 
> > >>> [email protected] wrote: 
> > >>> 
> > >>> > I have seen multiple people (In the Elixir Slack group 
> > >>> > <https://elixir-lang.slack.com/archives/general/p1462294660007855>, 
>
> > >>> > on Reddit 
> > >>> > < 
> > >>> 
> https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd
>  
> > >>> >) 
> > >>> > during the last couple of days requiring something that checks 
> > >>> > if a (possibly long) string contains e.g. only alphanumeric 
> > >>> > characters. 
> > >>> > 
> > >>> > It is possible to do this using regular expressions right now: 
> > >>> > ~r/[^[:alnum:]]/u 
> > >>> > 
> > >>> > but this is very slow. 
> > >>> > 
> > >>> > My proposal is to add the following boolean functions to the 
> > >>> > String module: 
> > >>> > 
> > >>> > 
> > >>> >    -  alphabetic? 
> > >>> >    -  numeric? 
> > >>> >    -  alphanumeric? 
> > >>> >    -  whitespace? 
> > >>> >    -  uppercase? 
> > >>> >    -  lowercase? 
> > >>> >    -  control_character? 
> > >>> > 
> > >>> > 
> > >>> > Function heads for these functions can probably be best 
> > >>> > generated by using compile-time macros similar to what other 
> > >>> > unicode-based functions already use. 
> > >>> > 
> > >>> 
> > >>> -- 
> > >>> You received this message because you are subscribed to the 
> > >>> Google Groups "elixir-lang-core" group. 
> > >>> To unsubscribe from this group and stop receiving emails from it, 
> > >>> send an email to [email protected]. 
> > >>> To view this discussion on the web visit 
> > >>> 
> https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org
>  
> > >>> . 
> > >>> For more options, visit https://groups.google.com/d/optout. 
> > >>> 
> > >> 
> > >> 
> > >> 
> > >> -- 
> > >> Eric Meadows-Jönsson 
> > >> 
> > > 
> > 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elixir-lang-core/ac97d01f-869e-45c2-aa60-72017f8dc07b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [elixir-core:5688] Add boolean methods for different unicode character groups (String.alphanumeric?, etc)

Reply via email to