What do people think whether "shape" should be included as a optional part of schema metadata or a required part of the schema itself?
I feel having it be required might be too restrictive for interop with other systems. On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <leif.wa...@gmail.com> wrote: > My gut feeling is that such a column type should specify both the shape and > primitive type of all values in the column. I can’t think of a common use > case that requires differently shaped tensors in a single column. > > Can anyone here come up with such a use case? > > If not, I can try to draft a proposal change to the spec that adds these > types. The next question is whether such a change can make it in (with c++ > and java implementations) before 1.0. > On Mon, Apr 9, 2018 at 17:36 Wes McKinney <wesmck...@gmail.com> wrote: > > > > As far as I know, there is an implementation of tensor type in > > C++/Python already. Should we just finalize the spec and add > implementation > > to Java? > > > > There is nothing specified yet as far as a *column* of > > ndarrays/tensors. We defined Tensor metadata for the purposes of > > IPC/serialization but made no effort to incorporate such data into the > > columnar format. > > > > There are likely many ways to implement column whose values are > > ndarrays, each cell with its own shape. Whether we would want to > > permit each cell to have a different ndarray cell type is another > > question (i.e. would we want to constrain every cell in a column to > > contain ndarrays of a particular type, like float64) > > > > So there's a couple of questions > > > > * How to represent the data using the columnar format > > * How to incorporate ndarray metadata into columnar schemas > > > > - Wes > > > > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ice.xell...@gmail.com> wrote: > > > As far as I know, there is an implementation of tensor type in > C++/Python > > > already. Should we just finalize the spec and add implementation to > Java? > > > > > > On the Spark side, it's probably more complicated as Vector and Matrix > > are > > > not "first class" types in Spark SQL. Spark ML implements them as UDT > > > (user-defined types) so it's not clear how to make Spark/Arrow > converter > > > work with them. > > > > > > I wonder if Bryan and Holden have some more thoughts on that? > > > > > > Li > > > > > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <leif.wa...@gmail.com> > wrote: > > > > > >> Hi all, > > >> > > >> I’ve been doing some work lately with Spark’s ML interfaces, which > > include > > >> sparse and dense Vector and Matrix types, backed on the Scala side by > > >> Breeze. Using these interfaces, you can construct DataFrames whose > > column > > >> types are vectors and matrices, and though the API isn’t terribly > rich, > > it > > >> is possible to run Python UDFs over such a DataFrame and get numpy > > ndarrays > > >> out of each row. However, if you’re using Spark’s Arrow serialization > > >> between the executor and Python workers, you get this > > >> UnsupportedOperationException: > > >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3 > > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/ > > >> execution/arrow/ArrowWriter.scala#L71 > > >> > > >> I think it would be useful for Arrow to support something like a > column > > of > > >> tensors, and I’d like to see if anyone else here is interested in > such a > > >> thing. If so, I’d like to propose adding it to the spec and getting > it > > >> implemented in at least Java and C++/Python. > > >> > > >> Some initial mildly-scattered thoughts: > > >> > > >> 1. You can certainly represent these today as List<Double> and > > >> List<List<Double>>, but then need to do some copying to get them back > > into > > >> numpy ndarrays. > > >> > > >> 2. In some cases it might be useful to know that a column contains > 3x3x4 > > >> tensors, for example, and not just that there are three dimensions as > > you’d > > >> get with List<List<List<Double>>>. This could constrain what > operations > > >> are meaningful (for example, in Spark you could imagine type checking > > that > > >> verifies dimension alignment for matrix multiplication). > > >> > > >> 3. You could approximate that with a FixedSizeList and metadata about > > the > > >> tensor shape. > > >> > > >> 4. But I kind of feel like this is generally useful enough that it’s > > worth > > >> having one implementation of it (well, one for each runtime) in Arrow. > > >> > > >> 5. Or, maybe everyone here thinks Spark should just do this with > > metadata? > > >> > > >> Curious to hear what you all think. > > >> > > >> Thanks, > > >> Leif > > >> > > >> -- > > >> -- > > >> Cheers, > > >> Leif > > >> > > > -- > -- > Cheers, > Leif >