Re: [PySpark DataFrame] When a Row is not a Row
We finally fix this in 1.5 (next release), see https://github.com/apache/spark/pull/7301 On Sat, Jul 11, 2015 at 10:32 PM, Jerry Lam chiling...@gmail.com wrote: Hi guys, I just hit the same problem. It is very confusing when Row is not the same Row type at runtime. The worst thing is that when I use Spark in local mode, the Row is the same Row type! so it passes the test cases but it fails when I deploy the application. Can someone suggest a workaround? Best Regards, Jerry -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-DataFrame-When-a-Row-is-not-a-Row-tp12210p13153.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [PySpark DataFrame] When a Row is not a Row
Hi guys, I just hit the same problem. It is very confusing when Row is not the same Row type at runtime. The worst thing is that when I use Spark in local mode, the Row is the same Row type! so it passes the test cases but it fails when I deploy the application. Can someone suggest a workaround? Best Regards, Jerry -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-DataFrame-When-a-Row-is-not-a-Row-tp12210p13153.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [PySpark DataFrame] When a Row is not a Row
Is there some way around this? For example, can Row just be an implementation of namedtuple throughout? from collections import namedtuple class Row(namedtuple): ... From a user perspective, it’s confusing that there are 2 different implementations of the Row class with the same name. In my case, I was writing a method to recursively convert a Row to a dict (since a Row can contain other Rows). I couldn’t directly check type(obj) == pyspark.sql.types.Row so I ended up having to do it like this: def row_to_dict(obj): Take a PySpark Row and convert it, and any of its nested Row objects, into Python dictionaries. if isinstance(obj, list): return [row_to_dict(x) for x in obj] else: try: # We can't reliably check that this is a row object # due to some weird bug. d = obj.asDict() return {k: row_to_dict(v) for k, v in d.iteritems()} except: return obj That comment about a “weird bug” was my initial reaction, though now I understand that we have 2 implementations of Row. Isn’t this worth fixing? It’s just going to confuse people, IMO. Nick On Tue, May 12, 2015 at 10:22 PM Davies Liu dav...@databricks.com wrote: The class (called Row) for rows from Spark SQL is created on the fly, is different from pyspark.sql.Row (is an public API to create Row by users). The reason we done it in this way is that we want to have better performance when accessing the columns. Basically, the rows are just named tuples (called `Row`). -- Davies Liu Sent with Sparrow http://www.sparrowmailapp.com/?sig 已使用 Sparrow http://www.sparrowmailapp.com/?sig 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道: This is really strange. # Spark 1.3.1 print type(results) class 'pyspark.sql.dataframe.DataFrame' a = results.take(1)[0] print type(a) class 'pyspark.sql.types.Row' print pyspark.sql.types.Row class 'pyspark.sql.types.Row' print type(a) == pyspark.sql.types.Row False print isinstance(a, pyspark.sql.types.Row) False If I set a as follows, then the type checks pass fine. a = pyspark.sql.types.Row('name')('Nick') Is this a bug? What can I do to narrow down the source? results is a massive DataFrame of spark-perf results. Nick
回复: [PySpark DataFrame] When a Row is not a Row
The class (called Row) for rows from Spark SQL is created on the fly, is different from pyspark.sql.Row (is an public API to create Row by users). The reason we done it in this way is that we want to have better performance when accessing the columns. Basically, the rows are just named tuples (called `Row`). -- Davies Liu Sent with Sparrow (http://www.sparrowmailapp.com/?sig) 已使用 Sparrow (http://www.sparrowmailapp.com/?sig) 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道: This is really strange. # Spark 1.3.1 print type(results) class 'pyspark.sql.dataframe.DataFrame' a = results.take(1)[0] print type(a) class 'pyspark.sql.types.Row' print pyspark.sql.types.Row class 'pyspark.sql.types.Row' print type(a) == pyspark.sql.types.Row False print isinstance(a, pyspark.sql.types.Row) False If I set a as follows, then the type checks pass fine. a = pyspark.sql.types.Row('name')('Nick') Is this a bug? What can I do to narrow down the source? results is a massive DataFrame of spark-perf results. Nick
[PySpark DataFrame] When a Row is not a Row
This is really strange. # Spark 1.3.1 print type(results) class 'pyspark.sql.dataframe.DataFrame' a = results.take(1)[0] print type(a) class 'pyspark.sql.types.Row' print pyspark.sql.types.Row class 'pyspark.sql.types.Row' print type(a) == pyspark.sql.types.Row False print isinstance(a, pyspark.sql.types.Row) False If I set a as follows, then the type checks pass fine. a = pyspark.sql.types.Row('name')('Nick') Is this a bug? What can I do to narrow down the source? results is a massive DataFrame of spark-perf results. Nick
Re: [PySpark DataFrame] When a Row is not a Row
In Row#equals(): while (i len) { if (apply(i) != that.apply(i)) { '!=' should be !apply(i).equals(that.apply(i)) ? Cheers On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: This is really strange. # Spark 1.3.1 print type(results) class 'pyspark.sql.dataframe.DataFrame' a = results.take(1)[0] print type(a) class 'pyspark.sql.types.Row' print pyspark.sql.types.Row class 'pyspark.sql.types.Row' print type(a) == pyspark.sql.types.Row False print isinstance(a, pyspark.sql.types.Row) False If I set a as follows, then the type checks pass fine. a = pyspark.sql.types.Row('name')('Nick') Is this a bug? What can I do to narrow down the source? results is a massive DataFrame of spark-perf results. Nick