maartenbreddels edited a comment on pull request #7593: URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096
I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake or camel casing for good tab completion and alphabetic ordering, so I agree with @wesm 's proposal. As a user, I first think of the type I work with ('string'), next, what I want to do with it (upper/lower casing). Tab-completion/ordering would then reveal 2 variants, very intuitive. To make this concrete: * string_lower_utf8 * string_lower_ascii * binary_contains_exact * string_contains_regex_ascii * string_contains_regex_utf8 (or s/contains/match 😄 ) There might be kernels that work on binary data, but do not work well with utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 string. I'd say that's acceptable, we could check that depending on the type we get it. Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org