shenker opened a new issue, #39183:
URL: https://github.com/apache/arrow/issues/39183

   ### Describe the enhancement requested
   
   Continuing discussion from https://github.com/apache/arrow/issues/15058. 
Relevant after @rok's PR https://github.com/apache/arrow/pull/37298 lands.
   
   Two ways of doing this come to mind. Thoughts?
   
   **Option 1: Allow string-UUID casting** 
   ```python
   import pyarrow as pa
   str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", 
"189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string())
   uuid_ary = str_ary.cast(pa.uuid())
   str_ary_roundtrip = uuid_ary.cast(pa.string())
   ```
   
   The canonical string representation of UUIDs contains `-`'s but it's not 
unusual to see them omitted, so my proposal would be to handle the cases where 
string is length 36 (`-`'s included), string is length 32 (no `-`'s), and error 
if string is in any other format. For the rare cases where strings have 
whitespace/other delimiters, it should be left up to the user to use string 
operations to convert them into one of the two accepted formats.
   
   For casting UUIDs back to strings, I'm not sure if there's a way (or if it's 
important enough to bother with) letting the user specify which of those two 
formats they prefer, so I'd propose UUIDs cast to strings should include the 
`-`'s. Or a flag could be added to `CastOptions`
   
   **Option 2: Implement general hex-encoding and -decoding functions**
   
   Here we implement the general operation of casting hex-encoded strings to 
binary data and vice-versa.
   ```python
   import pyarrow as pa
   import pyarrow.compute as pc
   str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", 
"189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string())
   # ignore_chars is a string of characters to silently skip when parsing 
hex-encoded strings, raise error if we see any unexpected characters
   bin_ary = pc.decode_hex(str_ary, ignore_chars="-") # will be type 
pa.binary(), variable-length binary
   bin_fixed_length_ary = bin_ary.cast(pa.binary(16)) # not sure if this should 
be required or not
   uuid_ary = bin_fixed_length_ary.cast(pa.uuid())
   str_ary_nodashes = pc.encode_hex(uuid_ary.cast(pa.binary(16))) # -> 
pa.string()
   ```
   
   To get the final UUID string with `-`'s from `str_ary_nodashes`, you could 
do that with existing string operations, but it might be better to just have a 
convenience function `pc.encode_uuid` that does the hex encoding and adds 
dashes at the same time:
   ```python
   str_ary_roundtrip = pc.encode_uuid(uuid_ary.cast(pa.binary(16))) # -> 
pa.string()
   ```
   
   
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to