maartenbreddels edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096
I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake
or camel casing for good tab completion and alphabetic ordering, so I agree
with @wesm 's proposal. As a user, I first think of the type I work with
('string'), next, what I want to do with it (upper/lower casing).
Tab-completion/ordering would then reveal 2 variants, very intuitive.
To make this concrete:
* string_lower_utf8
* string_lower_ascii
* binary_contains_exact
* string_contains_regex_ascii
* string_contains_regex_utf8
(or s/contains/match 😄 )
There might be kernels that work on binary data, but do not work well with
utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte
basis. It could cut a multibyte encoded codepoint to result in an invalid utf8
string. I'd say that's acceptable, we could check that depending on the type we
get it.
Regarding case insensitive variants, I think we should expose functions to
do normalization (eg replace the single codepoint ë by the letter e and the
diaeresis combining character(the dots above the ë)), case folding, and removal
of combining characters. That allows users to remove e.g. the diaeresis from ë
to do pattern matching without diacritics.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]