Since 2.2 there is Imputer: https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py
which should at least partially address the problem. On 06/22/2017 03:03 AM, Franklyn D'souza wrote: > I just wanted to highlight some of the rough edges around using > vectors in columns in dataframes. > > If there is a null in a dataframe column containing vectors pyspark ml > models like logistic regression will completely fail. > > However from what i've read there is no good way to fill in these > nulls with empty vectors. > > Its not possible to create a literal vector column expressiong and > coalesce it with the column from pyspark. > > so we're left with writing a python udf which does this coalesce, this > is really inefficient on large datasets and becomes a bottleneck for > ml pipelines working with real world data. > > I'd like to know how other users are dealing with this and what plans > there are to extend vector support for dataframes. > > Thanks!, > > Franklyn -- --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org