As a reference this is what is required to coalesce a vector column in pyspark.
df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,), (SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)], schema=schema empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)], schema=schema) df = df.crossJoin(empty_vector) df = df.withColumn('feature', F.coalesce('feature', '_empty_vector') On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > We've developed Scala UDFs internally to address some of these issues and > we'd love to upstream them back to spark. Just trying to figure out what > the vector support looks like on the road map. > > would it be best to put this functionality into the Imputer, > VectorAssembler or maybe try to give it more of a first class support in > dataframes by having it work with the lit column expression. > > On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza < > franklyn.dso...@shopify.com> wrote: > >> From the documentation it states that ` The input columns should be of >> DoubleType or FloatType.` so i dont think that is what im looking for. >> Also in general the API around vectors is highly lacking, especially from >> the pyspark side. >> >> Very common vector operations like addition, subtractions and dot >> products can't be performed. I'm wondering what the direction is with >> vector support in spark. >> >> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz < >> mszymkiew...@gmail.com> wrote: >> >>> Since 2.2 there is Imputer: >>> >>> https://github.com/apache/spark/blob/branch-2.2/examples/src >>> /main/python/ml/imputer_example.py >>> >>> which should at least partially address the problem. >>> >>> On 06/22/2017 03:03 AM, Franklyn D'souza wrote: >>> > I just wanted to highlight some of the rough edges around using >>> > vectors in columns in dataframes. >>> > >>> > If there is a null in a dataframe column containing vectors pyspark ml >>> > models like logistic regression will completely fail. >>> > >>> > However from what i've read there is no good way to fill in these >>> > nulls with empty vectors. >>> > >>> > Its not possible to create a literal vector column expressiong and >>> > coalesce it with the column from pyspark. >>> > >>> > so we're left with writing a python udf which does this coalesce, this >>> > is really inefficient on large datasets and becomes a bottleneck for >>> > ml pipelines working with real world data. >>> > >>> > I'd like to know how other users are dealing with this and what plans >>> > there are to extend vector support for dataframes. >>> > >>> > Thanks!, >>> > >>> > Franklyn >>> >>> -- >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> >> >