Hi Andrew, Thanks for the reply!

I did exactly that and considered first to see if we can start by only
handling it in the application level but that's a no go for us to migrate
to arrow (from our own type system) as this basically removes a lot of the
benefits such as the built-in csv writer, parquet and bunch of other things
that we will need to implement on our own and also this will create a
suboptimal experience (worse than the current one we have, hence we can't
migrate) for us and anyone building cloudquery plugins and using our SDK.

I created a PR <https://github.com/apache/arrow/pull/34454> for the Go
implementation already with an example of how we intended
<https://github.com/cloudquery/filetypes/tree/main/internal/cqarrow> to use
it.

Already got great feedback from Matt Topol. Any more feedback and ideas are
welcome. If this abstraction would work well I think other languages might
benefit from that (though for us right now we only use Go).

On Mon, Mar 6, 2023 at 2:08 PM Andrew Lamb <al...@influxdata.com> wrote:

> Hi Yevgeny,
>
> It is great you are thinking of using Arrow.
>
> > - The problems are around the abstraction for the extension types. While
> I
> understand that the underlying storage needs to be supported in the library
> we don't have a way for extensions to provide its own builder which means
> the user needs to know how the extension type stores the type inside the
> binary. This creates a leaky abstraction and the need for various helper
> functions like `UUIDToBinary`
>
> I don't have anything specific to offer in terms of the Go implementation.
>
> However, In terms of helping define a better abstraction, one way you might
> proceed is to forgo using the library support for extension types and
> implement support for your custom types yourself in your application code.
> Once you have figured out the most useful APIs, then perhaps you could
> propose contributing them to the arrow Go implementation.
>
> Andrew
>
>
>
>
>
>
> On Fri, Mar 3, 2023 at 5:54 AM Yevgeny Pats <y...@cloudquery.io> wrote:
>
> > Hey folks,
> >
> > Hopefully this is the right place to ask. As some background I'm Yevgeny
> > Pats <https://www.linkedin.com/in/yevgeny-pats-5973328b/>, Founder @
> > CloudQuery <https://github.com/cloudquery/cloudquery> . We are very
> > interested in migrating our protocol and Go type system to Apache Arrow.
> > Extensions are a critical part for us and thus I've the following
> questions
> > on whether it's a usage problem on my end or something that is not yet
> > available. I'll give here an example for Go but I believe the same issue
> > exists in all libraries/languages.
> >
> > Here is a public github gist
> > <https://gist.github.com/yevgenypats/6969e8e598161fc2021612c780bba3eb>.
> >
> > What are the problems:
> >
> > - The problems are around the abstraction for the extension types. While
> I
> > understand that the underlying storage needs to be supported in the
> library
> > we don't have a way for extensions to provide its own builder which means
> > the user needs to know how the extension type stores the type inside the
> > binary. This creates a leaky abstraction and the need for various helper
> > functions like `UUIDToBinary`
> > - The other way is fine as you can have methods like ToUUID on top of the
> > extension array. But this creates asymmetry in the abstraction.
> > - Because we don't control the builder for extensions this cripples into
> > other places like json
> > <https://github.com/apache/arrow/issues/34292#issuecomment-1446653210>
> and
> > csv where we can't control marshalling (in the same way we control all
> > other built-in types). So basically for extensions that use binary type
> as
> > underlying storage in case of json and csv those will always be encoded
> as
> > base64 which is not very useful (think about uuid, ip address, mac
> > address).
> >
> > The main point is that I think the right abstraction for extensions
> should
> > provide all the apis (type, array, builder) just like built-in types,
> > otherwise the abstraction is incomplete or "leaky". Of course we can
> still
> > have limitations like the custom builder must use an underlying known
> > storage (for it to work over ipc) but it can still control various other
> > types like marshaling, unmarshaling, building, and so on.
> >
> > Hopefully this gives enough context but would love to elaborate.
> >
> > Thanks,
> > Yevgeny
> >
>

Reply via email to