Re: Tensor column types in arrow

Wes McKinney Mon, 09 Apr 2018 14:37:20 -0700

> As far as I know, there is an implementation of tensor type in C++/Python 
> already. Should we just finalize the spec and add implementation to Java?


There is nothing specified yet as far as a *column* of
ndarrays/tensors. We defined Tensor metadata for the purposes of
IPC/serialization but made no effort to incorporate such data into the
columnar format.

There are likely many ways to implement column whose values are
ndarrays, each cell with its own shape. Whether we would want to
permit each cell to have a different ndarray cell type is another
question (i.e. would we want to constrain every cell in a column to
contain ndarrays of a particular type, like float64)

So there's a couple of questions

* How to represent the data using the columnar format
* How to incorporate ndarray metadata into columnar schemas

- Wes

On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <[email protected]> wrote:
> As far as I know, there is an implementation of tensor type in C++/Python
> already. Should we just finalize the spec and add implementation to Java?
>
> On the Spark side, it's probably more complicated as Vector and Matrix are
> not "first class" types in Spark SQL. Spark ML implements them as UDT
> (user-defined types) so it's not clear how to make Spark/Arrow converter
> work with them.
>
> I wonder if Bryan and Holden have some more thoughts on that?
>
> Li
>
> On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <[email protected]> wrote:
>
>> Hi all,
>>
>> I’ve been doing some work lately with Spark’s ML interfaces, which include
>> sparse and dense Vector and Matrix types, backed on the Scala side by
>> Breeze. Using these interfaces, you can construct DataFrames whose column
>> types are vectors and matrices, and though the API isn’t terribly rich, it
>> is possible to run Python UDFs over such a DataFrame and get numpy ndarrays
>> out of each row. However, if you’re using Spark’s Arrow serialization
>> between the executor and Python workers, you get this
>> UnsupportedOperationException:
>> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
>> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
>> execution/arrow/ArrowWriter.scala#L71
>>
>> I think it would be useful for Arrow to support something like a column of
>> tensors, and I’d like to see if anyone else here is interested in such a
>> thing.  If so, I’d like to propose adding it to the spec and getting it
>> implemented in at least Java and C++/Python.
>>
>> Some initial mildly-scattered thoughts:
>>
>> 1. You can certainly represent these today as List<Double> and
>> List<List<Double>>, but then need to do some copying to get them back into
>> numpy ndarrays.
>>
>> 2. In some cases it might be useful to know that a column contains 3x3x4
>> tensors, for example, and not just that there are three dimensions as you’d
>> get with List<List<List<Double>>>.  This could constrain what operations
>> are meaningful (for example, in Spark you could imagine type checking that
>> verifies dimension alignment for matrix multiplication).
>>
>> 3. You could approximate that with a FixedSizeList and metadata about the
>> tensor shape.
>>
>> 4. But I kind of feel like this is generally useful enough that it’s worth
>> having one implementation of it (well, one for each runtime) in Arrow.
>>
>> 5. Or, maybe everyone here thinks Spark should just do this with metadata?
>>
>> Curious to hear what you all think.
>>
>> Thanks,
>> Leif
>>
>> --
>> --
>> Cheers,
>> Leif
>>

Re: Tensor column types in arrow

Reply via email to