Hi all,

I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
I’ve found myself wanting more.

There is a minor issue in that with the arrow serialization enabled, these
types don’t serialize properly in python UDF calls or in toPandas. There’s
a natural representation for them in numpy.ndarray, and I’ve started a
conversation with the arrow community about supporting tensor-valued
columns, but that might be a ways out. In the meantime, I think we can fix
this by using the FixedSizeBinary column type in arrow, together with some
metadata describing the tensor shape (list of dimension sizes).

The larger issue, for which I intend to submit an SPIP soon, is that these
types could be better supported at the API layer, regardless of
serialization. In the limit, we could consider the entire numpy ndarray
surface area as a target. At the minimum, what I’m thinking is that these
types should support column operations like matrix multiply, transpose,
inner and outer product, etc., and maybe have a more ergonomic construction
API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
VectorAssembler API is kind of clunky.

One possibility here is to restrict the tensor column types such that every
value must have the same shape, e.g. a 2x2 matrix. This would allow for
operations to check validity before execution, for example, a matrix
multiply could check dimension match and fail fast. However, there might be
use cases for a column to contain variable shape tensors, I’m open to
discussion here.

What do you all think?

Reply via email to