As far as I know, there is an implementation of tensor type in C++/Python
already. Should we just finalize the spec and add implementation to Java?
On the Spark side, it's probably more complicated as Vector and Matrix are
not "first class" types in Spark SQL. Spark ML implements them as UDT
(user-defined types) so it's not clear how to make Spark/Arrow converter
work with them.
I wonder if Bryan and Holden have some more thoughts on that?
On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <leif.wa...@gmail.com> wrote:
> Hi all,
> I’ve been doing some work lately with Spark’s ML interfaces, which include
> sparse and dense Vector and Matrix types, backed on the Scala side by
> Breeze. Using these interfaces, you can construct DataFrames whose column
> types are vectors and matrices, and though the API isn’t terribly rich, it
> is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
> out of each row. However, if you’re using Spark’s Arrow serialization
> between the executor and Python workers, you get this
> I think it would be useful for Arrow to support something like a column of
> tensors, and I’d like to see if anyone else here is interested in such a
> thing. If so, I’d like to propose adding it to the spec and getting it
> implemented in at least Java and C++/Python.
> Some initial mildly-scattered thoughts:
> 1. You can certainly represent these today as List<Double> and
> List<List<Double>>, but then need to do some copying to get them back into
> numpy ndarrays.
> 2. In some cases it might be useful to know that a column contains 3x3x4
> tensors, for example, and not just that there are three dimensions as you’d
> get with List<List<List<Double>>>. This could constrain what operations
> are meaningful (for example, in Spark you could imagine type checking that
> verifies dimension alignment for matrix multiplication).
> 3. You could approximate that with a FixedSizeList and metadata about the
> tensor shape.
> 4. But I kind of feel like this is generally useful enough that it’s worth
> having one implementation of it (well, one for each runtime) in Arrow.
> 5. Or, maybe everyone here thinks Spark should just do this with metadata?
> Curious to hear what you all think.