I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug.
When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except for the fact that this is the default column name created with the count() operation in pyspark.sql.GroupedData The small example program below demonstrates the issue. from pyspark.sql import SQLContext sqlContext = SQLContext(sc) dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"]) counts = dataFrame.groupBy('title').count() counts.filter("title = 'foo'").show() # Works counts.filter("count > 1").show() # Errors out I can even reproduce the issue in a PySpark shell session by entering these commands. I suspect that the error has something to with Spark wanting to call the count() function in place of looking at the count column. The error message is as follows: Py4JJavaError Traceback (most recent call last) <ipython-input-29-62a1b7c71f21> in <module>() ----> 1 counts.filter("count > 1").show() # Errors Out C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc in filter(self, condition) 774 """ 775 if isinstance(condition, basestring): --> 776 jdf = self._jdf.filter(condition) 777 elif isinstance(condition, Column): 778 jdf = self._jdf.filter(condition._jc) C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o229.filter. : java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found count > 1 ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45) at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) Is there a recommended workaround to the inability to filter on a column named count? Do I have to make a new DataFrame and rename the column just to work around this bug? What's the best way to do that? Thanks, -- Matthew Young --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org