[jira] [Commented] (SPARK-17727) PySpark SQL arrays are not immutable, .remove and .pop cause issues

Hyukjin Kwon (JIRA) Wed, 02 Nov 2016 19:05:13 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631228#comment-15631228
 ]


Hyukjin Kwon commented on SPARK-17727:
--------------------------------------

Hi [~srowen], would this JIRA be resolved with an action?

> PySpark SQL arrays are not immutable, .remove and .pop cause issues
> -------------------------------------------------------------------
>
>                 Key: SPARK-17727
>                 URL: https://issues.apache.org/jira/browse/SPARK-17727
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0
>         Environment: OS X and Linux (Amazon Linux AMI release 2016.03), 
> Python 2.x
>            Reporter: Ganesh Sivalingam
>
> When having one column of a DataFrame as an array, for example:
> {code}
> +-------+---+---------+
> |join_on|  a|        b|
> +-------+---+---------+
> |      1|  1|[1, 2, 3]|
> |      1|  2|[1, 2, 3]|
> |      1|  3|[1, 2, 3]|
> +-------+---+---------+
> {code}
> If I try to remove the value in column a from the array in column b using 
> python's `list.remove(val)` function. It works, however, after running a 
> second manipulation of the dataframe it fails with an error saying that the 
> item (value in column a) is not present.
> So PySpark is re-running the `list.remove()` but on the already altered 
> list/array.
> Below is a minimal example, which I think should work, however exhibits this 
> issue:
> {code:python}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> import numpy as np
> cols = ['join_on', 'a']
> vals = [
>     (1, 1),
>     (1, 2),
>     (1, 3)
> ]
> df = sqlContext.createDataFrame(vals, cols)
> df_of_arrays = df\
>     .groupBy('join_on')\
>     .agg(F.collect_list('a').alias('b'))
> df = df\
>     .join(df_of_arrays, on='join_on')
> df.show()
> def rm_element(a, list_a):
>     list_a.remove(a)
>     return list_a
> rm_element_udf = F.udf(rm_element, T.ArrayType(T.LongType()))
> df = df.withColumn('one_removed', rm_element_udf("a", "b"))
> df.show()
> answer = df.withColumn('av', F.udf(lambda a: 
> float(np.mean(a)))('one_removed'))
> answer.show()
> {code}
> This can then be fixed by changing the rm_element function to:
> {code:python}
> def rm_element(a, list_a):
>     tmp_list_a = copy.deepcopy(list_a)
>     tmp_list_a.remove(a)
>     return tmp_list_a
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17727) PySpark SQL arrays are not immutable, .remove and .pop cause issues

Reply via email to