Re: Handling nulls in vector columns is non-trivial

2017-06-24 Thread Timur Shenkao
Hi Franklyn,

I had the same problem like yours with vectors & Maps. I tried:
1) UDF --> cumbersome and difficult to maintain. One has to re-write /
re-implement UDFs + extensive docs should be provided for colleagues +
something weird may happen when you migrate to new Spark version
2) RDD / DataFrame row-by-row approach --> slow but reliable &
understandable and you can fill-in & modify rows as you like
3) Spark SQL / DataFrame "magic tricks" approach --> it's like the code you
provided. Create DataFrames, join them, drop & modify somehow. Generally
speaking, it should be faster then approach #2
4) Encoders --> raw approach; basic types only. I chose this approach
finally. But I had to reformulate my problem in terms of basic data types
and POJO classes
5) Catalyst API Expressions --> I believe this is the right way but it
requires that you & your colleagues dive deep into Spark internals

Sincerely yours, Timur

On Fri, Jun 23, 2017 at 12:55 PM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> As a reference this is what is required to coalesce a vector column in
> pyspark.
>
> df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,),
> (SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)],
> schema=schema
> empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)],
> schema=schema)
> df = df.crossJoin(empty_vector)
> df = df.withColumn('feature', F.coalesce('feature', '_empty_vector')
>
>
>
> On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza <
> franklyn.dso...@shopify.com> wrote:
>
>> We've developed Scala UDFs internally to address some of these issues and
>> we'd love to upstream them back to spark. Just trying to figure out what
>> the vector support looks like on the road map.
>>
>> would it be best to put this functionality into the Imputer,
>> VectorAssembler or maybe try to give it more of a first class support in
>> dataframes by having it work with the lit column expression.
>>
>> On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza <
>> franklyn.dso...@shopify.com> wrote:
>>
>>> From the documentation it states that ` The input columns should be of
>>> DoubleType or FloatType.` so i dont think that is what im looking for.
>>> Also in general the API around vectors is highly lacking, especially from
>>> the pyspark side.
>>>
>>> Very common vector operations like addition, subtractions and dot
>>> products can't be performed. I'm wondering what the direction is with
>>> vector support in spark.
>>>
>>> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz <
>>> mszymkiew...@gmail.com> wrote:
>>>
 Since 2.2 there is Imputer:

 https://github.com/apache/spark/blob/branch-2.2/examples/src
 /main/python/ml/imputer_example.py

 which should at least partially address the problem.

 On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
 > I just wanted to highlight some of the rough edges around using
 > vectors in columns in dataframes.
 >
 > If there is a null in a dataframe column containing vectors pyspark ml
 > models like logistic regression will completely fail.
 >
 > However from what i've read there is no good way to fill in these
 > nulls with empty vectors.
 >
 > Its not possible to create a literal vector column expressiong and
 > coalesce it with the column from pyspark.
 >
 > so we're left with writing a python udf which does this coalesce, this
 > is really inefficient on large datasets and becomes a bottleneck for
 > ml pipelines working with real world data.
 >
 > I'd like to know how other users are dealing with this and what plans
 > there are to extend vector support for dataframes.
 >
 > Thanks!,
 >
 > Franklyn

 --


 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>
>


Re: Handling nulls in vector columns is non-trivial

2017-06-23 Thread Franklyn D'souza
As a reference this is what is required to coalesce a vector column in
pyspark.

df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,),
(SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)],
schema=schema
empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)],
schema=schema)
df = df.crossJoin(empty_vector)
df = df.withColumn('feature', F.coalesce('feature', '_empty_vector')



On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> We've developed Scala UDFs internally to address some of these issues and
> we'd love to upstream them back to spark. Just trying to figure out what
> the vector support looks like on the road map.
>
> would it be best to put this functionality into the Imputer,
> VectorAssembler or maybe try to give it more of a first class support in
> dataframes by having it work with the lit column expression.
>
> On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza <
> franklyn.dso...@shopify.com> wrote:
>
>> From the documentation it states that ` The input columns should be of
>> DoubleType or FloatType.` so i dont think that is what im looking for.
>> Also in general the API around vectors is highly lacking, especially from
>> the pyspark side.
>>
>> Very common vector operations like addition, subtractions and dot
>> products can't be performed. I'm wondering what the direction is with
>> vector support in spark.
>>
>> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz <
>> mszymkiew...@gmail.com> wrote:
>>
>>> Since 2.2 there is Imputer:
>>>
>>> https://github.com/apache/spark/blob/branch-2.2/examples/src
>>> /main/python/ml/imputer_example.py
>>>
>>> which should at least partially address the problem.
>>>
>>> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
>>> > I just wanted to highlight some of the rough edges around using
>>> > vectors in columns in dataframes.
>>> >
>>> > If there is a null in a dataframe column containing vectors pyspark ml
>>> > models like logistic regression will completely fail.
>>> >
>>> > However from what i've read there is no good way to fill in these
>>> > nulls with empty vectors.
>>> >
>>> > Its not possible to create a literal vector column expressiong and
>>> > coalesce it with the column from pyspark.
>>> >
>>> > so we're left with writing a python udf which does this coalesce, this
>>> > is really inefficient on large datasets and becomes a bottleneck for
>>> > ml pipelines working with real world data.
>>> >
>>> > I'd like to know how other users are dealing with this and what plans
>>> > there are to extend vector support for dataframes.
>>> >
>>> > Thanks!,
>>> >
>>> > Franklyn
>>>
>>> --
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Handling nulls in vector columns is non-trivial

2017-06-22 Thread Franklyn D'souza
We've developed Scala UDFs internally to address some of these issues and
we'd love to upstream them back to spark. Just trying to figure out what
the vector support looks like on the road map.

would it be best to put this functionality into the Imputer,
VectorAssembler or maybe try to give it more of a first class support in
dataframes by having it work with the lit column expression.

On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> From the documentation it states that ` The input columns should be of
> DoubleType or FloatType.` so i dont think that is what im looking for.
> Also in general the API around vectors is highly lacking, especially from
> the pyspark side.
>
> Very common vector operations like addition, subtractions and dot products
> can't be performed. I'm wondering what the direction is with vector support
> in spark.
>
> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Since 2.2 there is Imputer:
>>
>> https://github.com/apache/spark/blob/branch-2.2/examples/src
>> /main/python/ml/imputer_example.py
>>
>> which should at least partially address the problem.
>>
>> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
>> > I just wanted to highlight some of the rough edges around using
>> > vectors in columns in dataframes.
>> >
>> > If there is a null in a dataframe column containing vectors pyspark ml
>> > models like logistic regression will completely fail.
>> >
>> > However from what i've read there is no good way to fill in these
>> > nulls with empty vectors.
>> >
>> > Its not possible to create a literal vector column expressiong and
>> > coalesce it with the column from pyspark.
>> >
>> > so we're left with writing a python udf which does this coalesce, this
>> > is really inefficient on large datasets and becomes a bottleneck for
>> > ml pipelines working with real world data.
>> >
>> > I'd like to know how other users are dealing with this and what plans
>> > there are to extend vector support for dataframes.
>> >
>> > Thanks!,
>> >
>> > Franklyn
>>
>> --
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
>From the documentation it states that ` The input columns should be of
DoubleType or FloatType.` so i dont think that is what im looking for. Also
in general the API around vectors is highly lacking, especially from the
pyspark side.

Very common vector operations like addition, subtractions and dot products
can't be performed. I'm wondering what the direction is with vector support
in spark.

On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz 
wrote:

> Since 2.2 there is Imputer:
>
> https://github.com/apache/spark/blob/branch-2.2/
> examples/src/main/python/ml/imputer_example.py
>
> which should at least partially address the problem.
>
> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> > I just wanted to highlight some of the rough edges around using
> > vectors in columns in dataframes.
> >
> > If there is a null in a dataframe column containing vectors pyspark ml
> > models like logistic regression will completely fail.
> >
> > However from what i've read there is no good way to fill in these
> > nulls with empty vectors.
> >
> > Its not possible to create a literal vector column expressiong and
> > coalesce it with the column from pyspark.
> >
> > so we're left with writing a python udf which does this coalesce, this
> > is really inefficient on large datasets and becomes a bottleneck for
> > ml pipelines working with real world data.
> >
> > I'd like to know how other users are dealing with this and what plans
> > there are to extend vector support for dataframes.
> >
> > Thanks!,
> >
> > Franklyn
>
> --
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Maciej Szymkiewicz
Since 2.2 there is Imputer:

https://github.com/apache/spark/blob/branch-2.2/examples/src/main/python/ml/imputer_example.py

which should at least partially address the problem.

On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
> I just wanted to highlight some of the rough edges around using
> vectors in columns in dataframes. 
>
> If there is a null in a dataframe column containing vectors pyspark ml
> models like logistic regression will completely fail. 
>
> However from what i've read there is no good way to fill in these
> nulls with empty vectors. 
>
> Its not possible to create a literal vector column expressiong and
> coalesce it with the column from pyspark.
>  
> so we're left with writing a python udf which does this coalesce, this
> is really inefficient on large datasets and becomes a bottleneck for
> ml pipelines working with real world data.
>
> I'd like to know how other users are dealing with this and what plans
> there are to extend vector support for dataframes.
>
> Thanks!,
>
> Franklyn

-- 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org