Re: Tensor column types in arrow

Leif Walsh Tue, 10 Apr 2018 18:20:00 -0700

Thanks, I’ll create a jira and google doc. I agree those are the main
questions to iron out.


If there’s a desire to avoid scope creeping this in before 1.0, I think in
parallel I’ll start a conversation with the spark community about using the
existing FixedSizeBinary type plus some custom metadata to provide
serialization for their ML UDTs, and let them know that in the future if
this is added to arrow, they could switch that implementation to use those
arrow types instead.


On Tue, Apr 10, 2018 at 19:18 Wes McKinney <[email protected]> wrote:

> The simplest thing would be to have a "tensor" or "ndarray" type where
> each cell has the same shape. This would amount to adding the current
> "Tensor" Flatbuffers table to the Type union in
>
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L194
>
> The benefit of having each cell having the same shape is that the
> physical representation is FixedSizeBinary.
>
> Some caveats / notes:
>
> * We have a prior unresolved discussion about our approach to logical
> types. I could argue that this might fall into the same bucket of
> logical types. I don't think we should merge any patches related to
> this issue until we resolve that discussion
>
> * Using FixedSizeBinary as the physical representation constrains
> value sizes to 2GB (product of shape) because the FixedSizeBinary
> metadata uses int for the byteWidth. We might consider changing this
> to long (64 bits), but that's a separate discussion
>
> * If we permitted each cell to have a different shape, then we would
> need to use Binary (vs. FixedSizeBinary), which would limit the entire
> size of a column to 2GB of total tensor data. This could be mitigated
> by introducing LargeBinary (64 bit offsets), but this requires
> additional discussion (there is a JIRA about this already from some
> time ago)
>
> Given that we are still falling short of a complete implementation of
> other Arrow types (unions, intervals, fixed size lists), I urge all to
> be deliberate about not piling on more technical debt / format
> implementation shortfall if it can be avoided -- so a solution to this
> might be to have a patch for initial Tensor/Ndarray value support that
> is implemented in Java and/or C++
>
> How about creating a JIRA about this broad topic and creating a Google
> doc with a proposed implementation approach for discussion?
>
> Thanks
> Wes
>
> On Tue, Apr 10, 2018 at 5:48 PM, Li Jin <[email protected]> wrote:
> > What do people think whether "shape" should be included as a optional
> part
> > of schema metadata or a required part of the schema itself?
> >
> > I feel having it be required might be too restrictive for interop with
> > other systems.
> >
> > On Mon, Apr 9, 2018 at 9:13 PM, Leif Walsh <[email protected]> wrote:
> >
> >> My gut feeling is that such a column type should specify both the shape
> and
> >> primitive type of all values in the column. I can’t think of a common
> use
> >> case that requires differently shaped tensors in a single column.
> >>
> >> Can anyone here come up with such a use case?
> >>
> >> If not, I can try to draft a proposal change to the spec that adds these
> >> types. The next question is whether such a change can make it in (with
> c++
> >> and java implementations) before 1.0.
> >> On Mon, Apr 9, 2018 at 17:36 Wes McKinney <[email protected]> wrote:
> >>
> >> > > As far as I know, there is an implementation of tensor type in
> >> > C++/Python already. Should we just finalize the spec and add
> >> implementation
> >> > to Java?
> >> >
> >> > There is nothing specified yet as far as a *column* of
> >> > ndarrays/tensors. We defined Tensor metadata for the purposes of
> >> > IPC/serialization but made no effort to incorporate such data into the
> >> > columnar format.
> >> >
> >> > There are likely many ways to implement column whose values are
> >> > ndarrays, each cell with its own shape. Whether we would want to
> >> > permit each cell to have a different ndarray cell type is another
> >> > question (i.e. would we want to constrain every cell in a column to
> >> > contain ndarrays of a particular type, like float64)
> >> >
> >> > So there's a couple of questions
> >> >
> >> > * How to represent the data using the columnar format
> >> > * How to incorporate ndarray metadata into columnar schemas
> >> >
> >> > - Wes
> >> >
> >> > On Mon, Apr 9, 2018 at 5:30 PM, Li Jin <[email protected]> wrote:
> >> > > As far as I know, there is an implementation of tensor type in
> >> C++/Python
> >> > > already. Should we just finalize the spec and add implementation to
> >> Java?
> >> > >
> >> > > On the Spark side, it's probably more complicated as Vector and
> Matrix
> >> > are
> >> > > not "first class" types in Spark SQL. Spark ML implements them as
> UDT
> >> > > (user-defined types) so it's not clear how to make Spark/Arrow
> >> converter
> >> > > work with them.
> >> > >
> >> > > I wonder if Bryan and Holden have some more thoughts on that?
> >> > >
> >> > > Li
> >> > >
> >> > > On Mon, Apr 9, 2018 at 5:22 PM, Leif Walsh <[email protected]>
> >> wrote:
> >> > >
> >> > >> Hi all,
> >> > >>
> >> > >> I’ve been doing some work lately with Spark’s ML interfaces, which
> >> > include
> >> > >> sparse and dense Vector and Matrix types, backed on the Scala side
> by
> >> > >> Breeze. Using these interfaces, you can construct DataFrames whose
> >> > column
> >> > >> types are vectors and matrices, and though the API isn’t terribly
> >> rich,
> >> > it
> >> > >> is possible to run Python UDFs over such a DataFrame and get numpy
> >> > ndarrays
> >> > >> out of each row. However, if you’re using Spark’s Arrow
> serialization
> >> > >> between the executor and Python workers, you get this
> >> > >> UnsupportedOperationException:
> >> > >>
> https://github.com/apache/spark/blob/252468a744b95082400ba9e8b2e3b3
> >> > >> d9d50ab7fa/sql/core/src/main/scala/org/apache/spark/sql/
> >> > >> execution/arrow/ArrowWriter.scala#L71
> >> > >>
> >> > >> I think it would be useful for Arrow to support something like a
> >> column
> >> > of
> >> > >> tensors, and I’d like to see if anyone else here is interested in
> >> such a
> >> > >> thing.  If so, I’d like to propose adding it to the spec and
> getting
> >> it
> >> > >> implemented in at least Java and C++/Python.
> >> > >>
> >> > >> Some initial mildly-scattered thoughts:
> >> > >>
> >> > >> 1. You can certainly represent these today as List<Double> and
> >> > >> List<List<Double>>, but then need to do some copying to get them
> back
> >> > into
> >> > >> numpy ndarrays.
> >> > >>
> >> > >> 2. In some cases it might be useful to know that a column contains
> >> 3x3x4
> >> > >> tensors, for example, and not just that there are three dimensions
> as
> >> > you’d
> >> > >> get with List<List<List<Double>>>.  This could constrain what
> >> operations
> >> > >> are meaningful (for example, in Spark you could imagine type
> checking
> >> > that
> >> > >> verifies dimension alignment for matrix multiplication).
> >> > >>
> >> > >> 3. You could approximate that with a FixedSizeList and metadata
> about
> >> > the
> >> > >> tensor shape.
> >> > >>
> >> > >> 4. But I kind of feel like this is generally useful enough that
> it’s
> >> > worth
> >> > >> having one implementation of it (well, one for each runtime) in
> Arrow.
> >> > >>
> >> > >> 5. Or, maybe everyone here thinks Spark should just do this with
> >> > metadata?
> >> > >>
> >> > >> Curious to hear what you all think.
> >> > >>
> >> > >> Thanks,
> >> > >> Leif
> >> > >>
> >> > >> --
> >> > >> --
> >> > >> Cheers,
> >> > >> Leif
> >> > >>
> >> >
> >> --
> >> --
> >> Cheers,
> >> Leif
> >>
>
-- 
-- 
Cheers,
Leif

Re: Tensor column types in arrow

Reply via email to