maartenbreddels commented on pull request #7656: URL: https://github.com/apache/arrow/pull/7656#issuecomment-655425260
> [U+08BE](https://www.fileformat.info/info/unicode/char/08be/index.htm) was defined in Unicode 13, and category Lo is correct for that character. It sounds like you may be looking at obsolete Unicode tables? Thanks for that, as I replied in the issue on utf8proc, I didn't expect the Unicode data to change that fast (I guess Python3.7 doesn't support Unicode 13, information that is difficult to find actually). > Can't you use the Unicode category (N*) for this? That's [what Julia does](https://github.com/JuliaLang/julia/blob/master/base/strings/unicode.jl#L405). That's how I implemented it now, for instance https://graphemica.com/%E6%9F%92 has a numeric value of 7 (it's an example from the Unicode spec v13, section 4.6 http://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf ). Python lists this as numeric `assert '柒'.isnumeric() == True`, but it's General Category is 'Other letter'. I didn't open an issue, because I'm not sure where this information is, I have difficulty mapping between the spec, what Python does and the Unicode data files. And to be honest, I don't fully understand it's, and it's a small list: ``` 㐅, 㒃, 㠪, 㭍, 一, 七, 万, 三, 九, 二, 五, 亖, 亿, 什, 仟, 仨, 伍, 佰, 億, 兆, 兩, 八, 六, 十, 千, 卄, 卅, 卌, 叁, 参, 參, 叄, 四, 壱, 壹, 幺, 廾, 廿, 弌, 弍, 弎, 弐, 拾, 捌, 柒, 漆, 玖, 百, 肆, 萬, 貮, 貳, 贰, 阡, 陆, 陌, 陸, 零, 參, 拾, 兩, 零, 六, 陸, 什, ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
