Re: Handling nulls in vector columns is non-trivial

Franklyn D'souza Fri, 23 Jun 2017 03:56:27 -0700

As a reference this is what is required to coalesce a vector column in
pyspark.


df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,),
(SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)],
schema=schema
empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)],
schema=schema)
df = df.crossJoin(empty_vector)
df = df.withColumn('feature', F.coalesce('feature', '_empty_vector')



On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> We've developed Scala UDFs internally to address some of these issues and
> we'd love to upstream them back to spark. Just trying to figure out what
> the vector support looks like on the road map.
>
> would it be best to put this functionality into the Imputer,
> VectorAssembler or maybe try to give it more of a first class support in
> dataframes by having it work with the lit column expression.
>
> On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza <
> franklyn.dso...@shopify.com> wrote:
>
>> From the documentation it states that ` The input columns should be of
>> DoubleType or FloatType.` so i dont think that is what im looking for.
>> Also in general the API around vectors is highly lacking, especially from
>> the pyspark side.
>>
>> Very common vector operations like addition, subtractions and dot
>> products can't be performed. I'm wondering what the direction is with
>> vector support in spark.
>>
>> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> Since 2.2 there is Imputer:
>>>
>>> https://github.com/apache/spark/blob/branch-2.2/examples/src
>>> /main/python/ml/imputer_example.py
>>>
>>> which should at least partially address the problem.
>>>
>>> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
>>> > I just wanted to highlight some of the rough edges around using
>>> > vectors in columns in dataframes.
>>> >
>>> > If there is a null in a dataframe column containing vectors pyspark ml
>>> > models like logistic regression will completely fail.
>>> >
>>> > However from what i've read there is no good way to fill in these
>>> > nulls with empty vectors.
>>> >
>>> > Its not possible to create a literal vector column expressiong and
>>> > coalesce it with the column from pyspark.
>>> >
>>> > so we're left with writing a python udf which does this coalesce, this
>>> > is really inefficient on large datasets and becomes a bottleneck for
>>> > ml pipelines working with real world data.
>>> >
>>> > I'd like to know how other users are dealing with this and what plans
>>> > there are to extend vector support for dataframes.
>>> >
>>> > Thanks!,
>>> >
>>> > Franklyn
>>>
>>> --
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>

Re: Handling nulls in vector columns is non-trivial

Reply via email to