[ https://issues.apache.org/jira/browse/SPARK-13946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275807#comment-15275807 ]
Wes McKinney commented on SPARK-13946: -------------------------------------- The expression {{F.count(sdf2.foo)}} derives from a different logical table ({{sdf2}}) than {{sdf}}. In my opinion this should result in an analysis error. > PySpark DataFrames allows you to silently use aggregate expressions derived > from different table expressions > ------------------------------------------------------------------------------------------------------------ > > Key: SPARK-13946 > URL: https://issues.apache.org/jira/browse/SPARK-13946 > Project: Spark > Issue Type: Bug > Components: PySpark > Reporter: Wes McKinney > > In my opinion, this code should raise an exception rather than silently > discarding the predicate: > {code} > import numpy as np > import pandas as pd > df = pd.DataFrame({'foo': np.random.randn(1000000), > 'bar': np.random.randn(1000000)}) > sdf = sqlContext.createDataFrame(df) > sdf2 = sdf[sdf.bar > 0] > sdf.agg(F.count(sdf2.foo)).show() > +----------+ > |count(foo)| > +----------+ > | 1000000| > +----------+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org