[GitHub] [arrow] maartenbreddels commented on pull request #7656: ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels

GitBox Wed, 08 Jul 2020 03:09:53 -0700


maartenbreddels commented on pull request #7656:
URL: https://github.com/apache/arrow/pull/7656#issuecomment-655425260



   > [U+08BE](https://www.fileformat.info/info/unicode/char/08be/index.htm) was 
defined in Unicode 13, and category Lo is correct for that character. It sounds 
like you may be looking at obsolete Unicode tables?
   
   Thanks for that, as I replied in the issue on utf8proc, I didn't expect the 
Unicode data to change that fast (I guess Python3.7  doesn't support Unicode 
13, information that is difficult to find actually).
   
   
   
   > Can't you use the Unicode category (N*) for this? That's [what Julia 
does](https://github.com/JuliaLang/julia/blob/master/base/strings/unicode.jl#L405).
   
   That's how I implemented it now, for instance 
https://graphemica.com/%E6%9F%92 has a numeric value of 7 (it's an example from 
the Unicode spec v13, section 4.6 
http://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf ). 
Python lists this as numeric `assert '柒'.isnumeric() == True`, but it's General 
Category is 'Other letter'.
   
   I didn't open an issue, because I'm not sure where this information is, I 
have difficulty mapping between the spec, what Python does and the Unicode data 
files. 
   
   And to be honest, I don't fully understand it's, and it's a small list:
   ``` 㐅, 㒃, 㠪, 㭍, 一, 七, 万, 三, 九, 二, 五, 亖, 亿, 什, 仟, 仨, 伍, 佰, 億, 兆, 兩, 八, 六, 十, 
千, 卄, 卅, 卌, 叁, 参, 參, 叄, 四, 壱, 壹, 幺, 廾, 廿, 弌, 弍, 弎, 弐, 拾, 捌, 柒, 漆, 玖, 百, 肆, 萬, 
貮, 貳, 贰, 阡, 陆, 陌, 陸, 零, 參, 拾, 兩, 零, 六, 陸, 什, ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] maartenbreddels commented on pull request #7656: ARROW-9268: [C++] add string_is{alpnum,alpha...,upper} kernels

Reply via email to