Hey folks,

Hopefully this is the right place to ask. As some background I'm Yevgeny
Pats <https://www.linkedin.com/in/yevgeny-pats-5973328b/>, Founder @
CloudQuery <https://github.com/cloudquery/cloudquery> . We are very
interested in migrating our protocol and Go type system to Apache Arrow.
Extensions are a critical part for us and thus I've the following questions
on whether it's a usage problem on my end or something that is not yet
available. I'll give here an example for Go but I believe the same issue
exists in all libraries/languages.

Here is a public github gist
<https://gist.github.com/yevgenypats/6969e8e598161fc2021612c780bba3eb>.

What are the problems:

- The problems are around the abstraction for the extension types. While I
understand that the underlying storage needs to be supported in the library
we don't have a way for extensions to provide its own builder which means
the user needs to know how the extension type stores the type inside the
binary. This creates a leaky abstraction and the need for various helper
functions like `UUIDToBinary`
- The other way is fine as you can have methods like ToUUID on top of the
extension array. But this creates asymmetry in the abstraction.
- Because we don't control the builder for extensions this cripples into
other places like json
<https://github.com/apache/arrow/issues/34292#issuecomment-1446653210> and
csv where we can't control marshalling (in the same way we control all
other built-in types). So basically for extensions that use binary type as
underlying storage in case of json and csv those will always be encoded as
base64 which is not very useful (think about uuid, ip address, mac address).

The main point is that I think the right abstraction for extensions should
provide all the apis (type, array, builder) just like built-in types,
otherwise the abstraction is incomplete or "leaky". Of course we can still
have limitations like the custom builder must use an underlying known
storage (for it to work over ipc) but it can still control various other
types like marshaling, unmarshaling, building, and so on.

Hopefully this gives enough context but would love to elaborate.

Thanks,
Yevgeny

Reply via email to