My gut feeling is that such a column type should specify both the shape and primitive type of all values in the column. I can’t think of a common use case that requires differently shaped tensors in a single column.
Can anyone here come up with such a use case? If not, I can try to draft a proposal change to the spec that adds these types. The next question is whether such a change can make it in (with c++ and java implementations) before 1.0. On Mon, Apr 9, 2018 at 17:36 Wes McKinney <[email protected]> wrote: > > As far as I know, there is an implementation of tensor type in > C++/Python already. Should we just finalize the spec and add implementation > to Java? > > There is nothing specified yet as far as a *column* of > ndarrays/tensors. We defined Tensor metadata for the purposes of > IPC/serialization but made no effort to incorporate such data into the > columnar format. > > There are likely many ways to implement column whose values are > ndarrays, each cell with its own shape. Whether we would want to > permit each cell to have a different ndarray cell type is another > question (i.e. would we want to constrain every cell in a column to > contain ndarrays of a particular type, like float64) > > So there's a couple of questions > > * How to represent the data using the columnar format > * How to incorporate ndarray metadata into columnar schemas > > - Wes > > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <[email protected]> wrote: > > As far as I know, there is an implementation of tensor type in C++/Python > > already. Should we just finalize the spec and add implementation to Java? > > > > On the Spark side, it's probably more complicated as Vector and Matrix > are > > not "first class" types in Spark SQL. Spark ML implements them as UDT > > (user-defined types) so it's not clear how to make Spark/Arrow converter > > work with them. > > > > I wonder if Bryan and Holden have some more thoughts on that? > > > > Li > > > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <[email protected]> wrote: > > > >> Hi all, > >> > >> I’ve been doing some work lately with Spark’s ML interfaces, which > include > >> sparse and dense Vector and Matrix types, backed on the Scala side by > >> Breeze. Using these interfaces, you can construct DataFrames whose > column > >> types are vectors and matrices, and though the API isn’t terribly rich, > it > >> is possible to run Python UDFs over such a DataFrame and get numpy > ndarrays > >> out of each row. However, if you’re using Spark’s Arrow serialization > >> between the executor and Python workers, you get this > >> UnsupportedOperationException: > >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3 > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/ > >> execution/arrow/ArrowWriter.scala#L71 > >> > >> I think it would be useful for Arrow to support something like a column > of > >> tensors, and I’d like to see if anyone else here is interested in such a > >> thing. If so, I’d like to propose adding it to the spec and getting it > >> implemented in at least Java and C++/Python. > >> > >> Some initial mildly-scattered thoughts: > >> > >> 1. You can certainly represent these today as List<Double> and > >> List<List<Double>>, but then need to do some copying to get them back > into > >> numpy ndarrays. > >> > >> 2. In some cases it might be useful to know that a column contains 3x3x4 > >> tensors, for example, and not just that there are three dimensions as > you’d > >> get with List<List<List<Double>>>. This could constrain what operations > >> are meaningful (for example, in Spark you could imagine type checking > that > >> verifies dimension alignment for matrix multiplication). > >> > >> 3. You could approximate that with a FixedSizeList and metadata about > the > >> tensor shape. > >> > >> 4. But I kind of feel like this is generally useful enough that it’s > worth > >> having one implementation of it (well, one for each runtime) in Arrow. > >> > >> 5. Or, maybe everyone here thinks Spark should just do this with > metadata? > >> > >> Curious to hear what you all think. > >> > >> Thanks, > >> Leif > >> > >> -- > >> -- > >> Cheers, > >> Leif > >> > -- -- Cheers, Leif
