ChenTsing created ARROW-17301:
---------------------------------
Summary: implement compute function "utf8_slice_charunits"
Key: ARROW-17301
URL: https://issues.apache.org/jira/browse/ARROW-17301
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Affects Versions: 8.0.1
Reporter: ChenTsing
Fix For: 10.0.0
In some situations, may request an access method to get binary or sting likes
array one or some continuous bytes , for example start 1 end 3 step 1, the two
bytes, it seems like "{{{}binary_replace_slice{}}} " function, provide byte and
code two measurement unit
h1. *application case:*
here, I can give one example to descirbe why need a function to extract binary
in byte unit:
In distribute database, data has distribute policy and relatived hash
algorithm for different data type, here we just discuss string-like and binary
type, the hash algorithm need detach string-like or binary in bytes to
calculating, for example , take 1-4 byte cast to integer and shift-left 16
bits, then take 5-6byte cast to integer and the result from last step, and so
on, the 'utf8_slice_codeunits' function can partly meet the require if all are
ascii, but if the string-like contain chinese, one chinese may occupied three
bytes, start 1 to end 3, three utf8 character
may take nine bytes, but it not meet the hash algorithm, it only need 3
bytes, so if provide a function but not cast, the same function arguments like
'utf8_slice_codeunits', it may called 'binary_slice_byteunit'
--
This message was sent by Atlassian Jira
(v8.20.10#820010)