Hey sorry for the late reply just pushed the code to github.
The code to create the file is here: https://github.com/jisaacstone/ex_unicodedata/blob/master/lib/data/codepoint.ex Could probably reduce the size somewhat by using tuples instead of structs. -isaac On Wednesday, May 4, 2016 at 2:29:19 PM UTC-7, eksperimental wrote: > > Hi jisaacstone, > I would be really interested in seeing what you have, > specially that 600kb file > > thank you > > On Wed, 4 May 2016 12:36:51 -0700 (PDT) > [email protected] <javascript:> wrote: > > > I have done some exploratory work on this sort of thing (implementing > > python's UnicodeData) and a _correct_ implementation is both large > > and difficult. > > > > A full unicode properties file I created was 615k, which I think is > > about as large as the entire elixir standard library. > > > > And that is without the east asian character set. > > > > I think it would make a good 3rd party lib though. > > > > I never finished the work, but I can throw it up on github if someone > > else want to finish it. > > > > On Wednesday, May 4, 2016 at 9:05:21 AM UTC-7, Peter Marreck wrote: > > > > > > As an example of what would need to be done by necessity for proper > > > compliance with Unicode spec, check out the "Derived Property: > > > Alphabetic" codepoint list section of this doc: > > > > > > ftp://ftp.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt > > > > > > "Total code points: 110943" > > > > > > And that's just for the "is_alphabetic?" function! (Sure, this > > > would be macroed out, but as Eric said, it would definitely > > > increase the binary size further...) > > > > > > I still think this is useful functionality (and would likely be > > > many orders of magnitude faster than relying on Regex to determine > > > these things due to Elixir/Erlang's fast function-head pattern > > > matching) > > > > > > -- > > > Peter Marreck > > > > > > On Tuesday, May 3, 2016 at 6:24:33 PM UTC-4, Eric Meadows-Jönsson > > > wrote: > > >> > > >> The problem is that the Unicode module is already big, the file > > >> size of the .beam file is one of the largest in elixir. There are > > >> also issues compiling this file on systems with 512mb memory. > > >> idna, an erlang library for unicode, have similar issues on > > >> systems with low memory. Adding more functions that will need a > > >> large number of function clauses will make the issue worse and the > > >> size of the compiled elixir we distribute larger. > > >> > > >> I think it's better to have this functionality in a library until > > >> we can solve the memory issue and only have the bare necessities > > >> for unicode support in stdlib. If we later can move it into stdlib > > >> it would be good to have the API figured out and bugs fixed in > > >> another library that can iterate faster. > > >> > > >> On Tue, May 3, 2016 at 11:29 PM, eksperimental > > >> <[email protected]> wrote: > > >> > > >>> I'm not too sure if we should have all those many functions > > >>> should be added. it could be too many of them, and not easy to > > >>> extend.. but how about an Unicode.info/1 function, that returns a > > >>> tuple with information about that character. such as > > >>> iex> Unicode.info("A") > > >>> ...> {:alphanumeric, :uppercase, :ascii} > > >>> > > >>> It will be easy to improve as we find more information can be > > >>> added, such as ISO types and other groups (Specially to encodings > > >>> we are not familiar with) > > >>> > > >>> Additionally we could have check?/2 (or some better name > > >>> probably!) > > >>> iex> Unicode.check?("A", :uppercase) > > >>> ...> true > > >>> iex> Unicode.check?("A", :numeric) > > >>> ...> false > > >>> > > >>> > > >>> created, but On Tue, 3 May 2016 12:31:44 -0700 (PDT) > > >>> [email protected] wrote: > > >>> > > >>> > I have seen multiple people (In the Elixir Slack group > > >>> > <https://elixir-lang.slack.com/archives/general/p1462294660007855>, > > > >>> > on Reddit > > >>> > < > > >>> > https://www.reddit.com/r/elixir/comments/4h4y4e/whats_missing_from_the_elixir_ecosystem/d2nvbwd > > > >>> >) > > >>> > during the last couple of days requiring something that checks > > >>> > if a (possibly long) string contains e.g. only alphanumeric > > >>> > characters. > > >>> > > > >>> > It is possible to do this using regular expressions right now: > > >>> > ~r/[^[:alnum:]]/u > > >>> > > > >>> > but this is very slow. > > >>> > > > >>> > My proposal is to add the following boolean functions to the > > >>> > String module: > > >>> > > > >>> > > > >>> > - alphabetic? > > >>> > - numeric? > > >>> > - alphanumeric? > > >>> > - whitespace? > > >>> > - uppercase? > > >>> > - lowercase? > > >>> > - control_character? > > >>> > > > >>> > > > >>> > Function heads for these functions can probably be best > > >>> > generated by using compile-time macros similar to what other > > >>> > unicode-based functions already use. > > >>> > > > >>> > > >>> -- > > >>> You received this message because you are subscribed to the > > >>> Google Groups "elixir-lang-core" group. > > >>> To unsubscribe from this group and stop receiving emails from it, > > >>> send an email to [email protected]. > > >>> To view this discussion on the web visit > > >>> > https://groups.google.com/d/msgid/elixir-lang-core/20160504042910.57fd86e0.eksperimental%40autistici.org > > > >>> . > > >>> For more options, visit https://groups.google.com/d/optout. > > >>> > > >> > > >> > > >> > > >> -- > > >> Eric Meadows-Jönsson > > >> > > > > > > > -- You received this message because you are subscribed to the Google Groups "elixir-lang-core" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/ac97d01f-869e-45c2-aa60-72017f8dc07b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
