[ 
https://issues.apache.org/jira/browse/ARROW-9820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184756#comment-17184756
 ] 

Lawrence Chan commented on ARROW-9820:
--------------------------------------

- Language-agnostic - once a storage driver is written/built, _any_ arrow 
library can load it (assuming we've finished implementing the plugin API). So 
rather than needing to add support to each language, I just need to write the 
wrapper once, and then users can use that filesystem in C++, python, go, rust, 
whatever.
- Application-agnostic - if users want to use my storage driver in a downstream 
application, I can distribute a plugin and arrow can load the plugin at runtime 
without needing to do a special build of that application with my filesystem 
code.  This greatly simplifies the ability for users to add storage 
functionality without recompiling the entire world that uses arrow.  You might 
argue that this could be achieved by linking arrow as a shared library, but 
there are use cases where static linking is desirable, or use cases where I 
don't control the arrow shared library but the users can obtain my plugin.
- Maintainer-friendly - if I maintain a storage driver plugin, I can version 
control it entirely independently, distribute it separately from the arrow 
library, and have a simpler build system that doesnt necessarily need to 
integrate with the arrow cmake machinery.  Otherwise somehow cmake needs to 
know about the extra filesystem implementation and needs to do something to 
embed it at compile-time.
- There are also some functions in the C++ library that have hardcoded string 
comparisions to e.g. "hdfs".  These are not the hardest ones to solve, because 
we could switch it to a lookup from a global mapping that the user can register 
factory function to, but I figured I would mention them anyways.

If you are wondering about the concrete hurdle that prompted this, it's that 
the pyarrow bits are seemingly half wrappers to the C++ lib and and half 
implemented in python, with what I _think_ are manually-written Cython wrappers 
around the pieces that need to be visible in python.  For my storage library, I 
don't really want to mess with forking pyarrow and writing Cython wrappers and 
rebuilding pyarrow, and I'd like to just do it once in C/C++ and have it work 
in pyarrow automatically.

I understand the hesitation here, but I think the scary bits can be done 
safely, and I think this will open the doors to a more organized and 
community-driven collection of storage drivers without cluttering the arrow 
codebase.  For some related prior art, this feels to me like a tiny lower-level 
version of CSI plugins.  If we wanted to support the whole universe of drivers 
from within the arrow codebase, it would get pretty bloated.

> [C++] Plugin Architecture for Filesystem and File IO
> ----------------------------------------------------
>
>                 Key: ARROW-9820
>                 URL: https://issues.apache.org/jira/browse/ARROW-9820
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Lawrence Chan
>            Priority: Minor
>
> Adding a new custom filesystem with corresponding file i/o streams is quite a 
> process at the moment.  Looks like HDFS and S3FS are basically hardcoded in 
> many places.  It would be useful to develop a plugin system to allow users to 
> interface with other data stores without maintaining a permanent fork with 
> hardcoded changes.
> We can either do runtime plugins or compile-time plugins.  Runtime is more 
> user-friendly, but with C++, ABI compatibility is fairly delicate.  So we 
> would either want to use a C ABI or accept a youre-on-your-own situation 
> where the user is expected to be very careful with versioning and compiler 
> flags.
> With compile-time plugins, maybe there's a way to have the cmake machinery 
> build third party code and also register those new URI schemes automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to