[ https://issues.apache.org/jira/browse/SPARK-17727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15631228#comment-15631228 ]
Hyukjin Kwon commented on SPARK-17727: -------------------------------------- Hi [~srowen], would this JIRA be resolved with an action? > PySpark SQL arrays are not immutable, .remove and .pop cause issues > ------------------------------------------------------------------- > > Key: SPARK-17727 > URL: https://issues.apache.org/jira/browse/SPARK-17727 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 2.0.0 > Environment: OS X and Linux (Amazon Linux AMI release 2016.03), > Python 2.x > Reporter: Ganesh Sivalingam > > When having one column of a DataFrame as an array, for example: > {code} > +-------+---+---------+ > |join_on| a| b| > +-------+---+---------+ > | 1| 1|[1, 2, 3]| > | 1| 2|[1, 2, 3]| > | 1| 3|[1, 2, 3]| > +-------+---+---------+ > {code} > If I try to remove the value in column a from the array in column b using > python's `list.remove(val)` function. It works, however, after running a > second manipulation of the dataframe it fails with an error saying that the > item (value in column a) is not present. > So PySpark is re-running the `list.remove()` but on the already altered > list/array. > Below is a minimal example, which I think should work, however exhibits this > issue: > {code:python} > import pyspark.sql.functions as F > import pyspark.sql.types as T > import numpy as np > cols = ['join_on', 'a'] > vals = [ > (1, 1), > (1, 2), > (1, 3) > ] > df = sqlContext.createDataFrame(vals, cols) > df_of_arrays = df\ > .groupBy('join_on')\ > .agg(F.collect_list('a').alias('b')) > df = df\ > .join(df_of_arrays, on='join_on') > df.show() > def rm_element(a, list_a): > list_a.remove(a) > return list_a > rm_element_udf = F.udf(rm_element, T.ArrayType(T.LongType())) > df = df.withColumn('one_removed', rm_element_udf("a", "b")) > df.show() > answer = df.withColumn('av', F.udf(lambda a: > float(np.mean(a)))('one_removed')) > answer.show() > {code} > This can then be fixed by changing the rm_element function to: > {code:python} > def rm_element(a, list_a): > tmp_list_a = copy.deepcopy(list_a) > tmp_list_a.remove(a) > return tmp_list_a > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org