shenker opened a new issue, #39183: URL: https://github.com/apache/arrow/issues/39183
### Describe the enhancement requested Continuing discussion from https://github.com/apache/arrow/issues/15058. Relevant after @rok's PR https://github.com/apache/arrow/pull/37298 lands. Two ways of doing this come to mind. Thoughts? **Option 1: Allow string-UUID casting** ```python import pyarrow as pa str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", "189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string()) uuid_ary = str_ary.cast(pa.uuid()) str_ary_roundtrip = uuid_ary.cast(pa.string()) ``` The canonical string representation of UUIDs contains `-`'s but it's not unusual to see them omitted, so my proposal would be to handle the cases where string is length 36 (`-`'s included), string is length 32 (no `-`'s), and error if string is in any other format. For the rare cases where strings have whitespace/other delimiters, it should be left up to the user to use string operations to convert them into one of the two accepted formats. For casting UUIDs back to strings, I'm not sure if there's a way (or if it's important enough to bother with) letting the user specify which of those two formats they prefer, so I'd propose UUIDs cast to strings should include the `-`'s. Or a flag could be added to `CastOptions` **Option 2: Implement general hex-encoding and -decoding functions** Here we implement the general operation of casting hex-encoded strings to binary data and vice-versa. ```python import pyarrow as pa import pyarrow.compute as pc str_ary = pa.array(["bf1e9e1f-feab-4230-8a37-0240ccbefe8a", "189a6934-99ef-4a70-b10f-cbb5ef178373"], pa.string()) # ignore_chars is a string of characters to silently skip when parsing hex-encoded strings, raise error if we see any unexpected characters bin_ary = pc.decode_hex(str_ary, ignore_chars="-") # will be type pa.binary(), variable-length binary bin_fixed_length_ary = bin_ary.cast(pa.binary(16)) # not sure if this should be required or not uuid_ary = bin_fixed_length_ary.cast(pa.uuid()) str_ary_nodashes = pc.encode_hex(uuid_ary.cast(pa.binary(16))) # -> pa.string() ``` To get the final UUID string with `-`'s from `str_ary_nodashes`, you could do that with existing string operations, but it might be better to just have a convenience function `pc.encode_uuid` that does the hex encoding and adds dashes at the same time: ```python str_ary_roundtrip = pc.encode_uuid(uuid_ary.cast(pa.binary(16))) # -> pa.string() ``` ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
