Re: Arrow Dataset API on Ceph

Ivo Jimenez Fri, 28 Aug 2020 10:27:41 -0700

Hi Antoine

> Yes, that is our plan. Since this is going to be done on the storage-,
> > server-side, this would be transparent to the client. So our main concern
> > is whether this be OK from the design perspective, and could this
> > eventually be merged upstream?
>
> Arrow datasets have no notion of client and server, so I'm not sure what
> you mean here.



Sorry for the confusion. This is where we see a mismatch between the
current design and what we are trying to achieve.

Our goal is to push down computations in a cloud storage system. By pushing
we mean actually sending computation tasks to storage nodes (e.g. filtering
executing on storage nodes). Ideally this would be done by implementing a
new plugin for arrow::fs but as far as we can tell, this filesystem layer
is unaware of expressions, record batches, etc. so this information cannot
be communicated down to storage.

So what we thought would work is to implement this at the Dataset API
level, and implement a scanner (and writer) that would be deferring these
operations to storage nodes. For example, the RadosScanTask class will ask
a storage node to actually do a scan and fetch the result, as opposed to do
the scan locally.

We would immensely appreciate it if you could let us know if the above is
OK, or if you think there is a better alternative for accomplishing this,
as we would rather implement this functionality in a way that is
compatible with your overall vision.


> Do you simply mean contributing RadosFormat to the Arrow
> codebase?


Yes, so that others wanting to achieve this on a Ceph cluster could
leverage this as well.


> I would say that depends on the required dependencies, and
> ease of testing (and/or CI) for other developers.


OK, yes we will pay attention to these aspects as part of an eventual PR.
We will include tests and ensure that CI covers the changes we introduce.

thanks!

Re: Arrow Dataset API on Ceph

Reply via email to