I'm trying to do some simple counting and aggregation in an IPython notebook 
with Spark 1.4.0 and I have encountered behavior that looks like a bug.

When I try to filter rows out of an RDD with a column name of count I get a 
large error message. I would just avoid naming things count, except for the 
fact that this is the default column name created with the count() operation in 
pyspark.sql.GroupedData

The small example program below demonstrates the issue.

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dataFrame = sc.parallelize([("foo",), ("foo",), ("bar",)]).toDF(["title"])
counts = dataFrame.groupBy('title').count()
counts.filter("title = 'foo'").show() # Works
counts.filter("count > 1").show()     # Errors out


I can even reproduce the issue in a PySpark shell session by entering these 
commands.

I suspect that the error has something to with Spark wanting to call the 
count() function in place of looking at the count column.

The error message is as follows:


Py4JJavaError                             Traceback (most recent call last)
<ipython-input-29-62a1b7c71f21> in <module>()
----> 1 counts.filter("count > 1").show() # Errors Out

C:\Users\User\Downloads\spark-1.4.0-bin-hadoop2.6\python\pyspark\sql\dataframe.pyc
 in filter(self, condition)
    774         """
    775         if isinstance(condition, basestring):
--> 776             jdf = self._jdf.filter(condition)
    777         elif isinstance(condition, Column):
    778             jdf = self._jdf.filter(condition._jc)

C:\Python27\lib\site-packages\py4j\java_gateway.pyc in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539
    540         for temp_arg in temp_args:

C:\Python27\lib\site-packages\py4j\protocol.pyc in get_return_value(answer, 
gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o229.filter.
: java.lang.RuntimeException: [1.7] failure: ``('' expected but `>' found

count > 1
      ^
        at scala.sys.package$.error(package.scala:27)
        at 
org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:45)
        at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:652)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Unknown Source)



Is there a recommended workaround to the inability to filter on a column named 
count? Do I have to make a new DataFrame and rename the column just to work 
around this bug? What's the best way to do that?

Thanks,

-- Matthew Young

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to