Re: Tensor column types in arrow

Li Jin Tue, 10 Apr 2018 14:49:17 -0700

What do people think whether "shape" should be included as a optional part
of schema metadata or a required part of the schema itself?


I feel having it be required might be too restrictive for interop with
other systems.

On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <leif.wa...@gmail.com> wrote:

> My gut feeling is that such a column type should specify both the shape and
> primitive type of all values in the column. I can’t think of a common use
> case that requires differently shaped tensors in a single column.
>
> Can anyone here come up with such a use case?
>
> If not, I can try to draft a proposal change to the spec that adds these
> types. The next question is whether such a change can make it in (with c++
> and java implementations) before 1.0.
> On Mon, Apr 9, 2018 at 17:36 Wes McKinney <wesmck...@gmail.com> wrote:
>
> > > As far as I know, there is an implementation of tensor type in
> > C++/Python already. Should we just finalize the spec and add
> implementation
> > to Java?
> >
> > There is nothing specified yet as far as a *column* of
> > ndarrays/tensors. We defined Tensor metadata for the purposes of
> > IPC/serialization but made no effort to incorporate such data into the
> > columnar format.
> >
> > There are likely many ways to implement column whose values are
> > ndarrays, each cell with its own shape. Whether we would want to
> > permit each cell to have a different ndarray cell type is another
> > question (i.e. would we want to constrain every cell in a column to
> > contain ndarrays of a particular type, like float64)
> >
> > So there's a couple of questions
> >
> > * How to represent the data using the columnar format
> > * How to incorporate ndarray metadata into columnar schemas
> >
> > - Wes
> >
> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <ice.xell...@gmail.com> wrote:
> > > As far as I know, there is an implementation of tensor type in
> C++/Python
> > > already. Should we just finalize the spec and add implementation to
> Java?
> > >
> > > On the Spark side, it's probably more complicated as Vector and Matrix
> > are
> > > not "first class" types in Spark SQL. Spark ML implements them as UDT
> > > (user-defined types) so it's not clear how to make Spark/Arrow
> converter
> > > work with them.
> > >
> > > I wonder if Bryan and Holden have some more thoughts on that?
> > >
> > > Li
> > >
> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <leif.wa...@gmail.com>
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
> > include
> > >> sparse and dense Vector and Matrix types, backed on the Scala side by
> > >> Breeze. Using these interfaces, you can construct DataFrames whose
> > column
> > >> types are vectors and matrices, and though the API isn’t terribly
> rich,
> > it
> > >> is possible to run Python UDFs over such a DataFrame and get numpy
> > ndarrays
> > >> out of each row. However, if you’re using Spark’s Arrow serialization
> > >> between the executor and Python workers, you get this
> > >> UnsupportedOperationException:
> > >> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> > >> execution/arrow/ArrowWriter.scala#L71
> > >>
> > >> I think it would be useful for Arrow to support something like a
> column
> > of
> > >> tensors, and I’d like to see if anyone else here is interested in
> such a
> > >> thing.  If so, I’d like to propose adding it to the spec and getting
> it
> > >> implemented in at least Java and C++/Python.
> > >>
> > >> Some initial mildly-scattered thoughts:
> > >>
> > >> 1. You can certainly represent these today as List<Double> and
> > >> List<List<Double>>, but then need to do some copying to get them back
> > into
> > >> numpy ndarrays.
> > >>
> > >> 2. In some cases it might be useful to know that a column contains
> 3x3x4
> > >> tensors, for example, and not just that there are three dimensions as
> > you’d
> > >> get with List<List<List<Double>>>.  This could constrain what
> operations
> > >> are meaningful (for example, in Spark you could imagine type checking
> > that
> > >> verifies dimension alignment for matrix multiplication).
> > >>
> > >> 3. You could approximate that with a FixedSizeList and metadata about
> > the
> > >> tensor shape.
> > >>
> > >> 4. But I kind of feel like this is generally useful enough that it’s
> > worth
> > >> having one implementation of it (well, one for each runtime) in Arrow.
> > >>
> > >> 5. Or, maybe everyone here thinks Spark should just do this with
> > metadata?
> > >>
> > >> Curious to hear what you all think.
> > >>
> > >> Thanks,
> > >> Leif
> > >>
> > >> --
> > >> --
> > >> Cheers,
> > >> Leif
> > >>
> >
> --
> --
> Cheers,
> Leif
>

Re: Tensor column types in arrow

Reply via email to