Alberto Bonsanto created SPARK-17760:
----------------------------------------
Summary: DataFrame's pivot doesn't see column created in groupBy
Key: SPARK-17760
URL: https://issues.apache.org/jira/browse/SPARK-17760
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.0.0
Environment: Databrick's community version, spark 2.0.0, pyspark,
python 2.
Reporter: Alberto Bonsanto
Priority: Minor
Related to
[https://stackoverflow.com/questions/39817993/pivoting-with-missing-values].
I'm not completely sure if this is a bug or expected behavior.
When you `groypBy` by a column generated inside of it, the `pivot` method
apparently doesn't find this column during the analysis.
E.g.
{code:none}
df = (sc.parallelize([(1.0, "2016-03-30 01:00:00"),
(30.2, "2015-01-02 03:00:02")])
.toDF(["amount", "Date"])
.withColumn("Date", col("Date").cast("timestamp")))
(df.withColumn("hour",hour("date"))
.groupBy(dayofyear("date").alias("date"))
.pivot("hour").sum("amount").show()){code}
Shows the following exception.
{quote}
AnalysisException: u'resolved attribute(s) date#140688 missing from
dayofyear(date)#140994,hour#140977,sum(`amount`)#140995 in operator !Aggregate
\[dayofyear(cast(date#140688 as date))], [dayofyear(cast(date#140688 as date))
AS dayofyear(date)#140994, pivotfirst(hour#140977, sum(`amount`)#140995, 1, 3,
0, 0) AS __pivot_sum(`amount`) AS `sum(``amount``)`#141001\];'
{quote}
To solve it you have to add the column {{date}} before grouping and pivoting.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]