[GitHub] [arrow] maartenbreddels edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

GitBox Wed, 01 Jul 2020 00:52:54 -0700


maartenbreddels edited a comment on pull request #7593:
URL: https://github.com/apache/arrow/pull/7593#issuecomment-652254096



   I like the prefixing by `string`. I'm a big fan of ordering 'words' in snake 
or camel casing for good tab completion and alphabetic ordering, so I agree 
with @wesm 's proposal. As a user, I first think of the type I work with 
('string'), next, what I want to do with it (upper/lower casing). 
Tab-completion/ordering would then reveal 2 variants, very intuitive.
   To make this concrete:
    * string_lower_utf8
    * string_lower_ascii
    * binary_contains_exact
    * string_contains_regex_ascii
    * string_contains_regex_utf8
   
   (or s/contains/match 😄 )
   
   There might be kernels that work on binary data, but do not work well with 
utf8, e.g. a `binary_slice`, which would slice each binary-string on a byte 
basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 
string. I'd say that's acceptable, we could check that depending on the type we 
get it.
   
   Regarding case insensitive variants, I think we should expose functions to 
do normalization (eg replace the single codepoint ë by the letter e and the 
diaeresis combining character(the dots above the ë)), case folding, and removal 
of combining characters. That allows users to remove e.g. the diaeresis from ë 
to do pattern matching without diacritics.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] maartenbreddels edited a comment on pull request #7593: ARROW-9160: [C++] Implement contains for exact matches

Reply via email to