Re: [PySpark DataFrame] When a Row is not a Row

2015-07-12 Thread Davies Liu
We finally fix this in 1.5 (next release), see
https://github.com/apache/spark/pull/7301

On Sat, Jul 11, 2015 at 10:32 PM, Jerry Lam chiling...@gmail.com wrote:
 Hi guys,

 I just hit the same problem. It is very confusing when Row is not the same
 Row type at runtime. The worst thing is that when I use Spark in local mode,
 the Row is the same Row type! so it passes the test cases but it fails when
 I deploy the application.

 Can someone suggest a workaround?

 Best Regards,

 Jerry



 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-DataFrame-When-a-Row-is-not-a-Row-tp12210p13153.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [PySpark DataFrame] When a Row is not a Row

2015-07-11 Thread Jerry Lam
Hi guys,

I just hit the same problem. It is very confusing when Row is not the same
Row type at runtime. The worst thing is that when I use Spark in local mode,
the Row is the same Row type! so it passes the test cases but it fails when
I deploy the application. 

Can someone suggest a workaround?

Best Regards,

Jerry



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-DataFrame-When-a-Row-is-not-a-Row-tp12210p13153.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [PySpark DataFrame] When a Row is not a Row

2015-05-13 Thread Nicholas Chammas
Is there some way around this? For example, can Row just be an
implementation of namedtuple throughout?

from collections import namedtuple
class Row(namedtuple):
...

From a user perspective, it’s confusing that there are 2 different
implementations of the Row class with the same name.

In my case, I was writing a method to recursively convert a Row to a dict
(since a Row can contain other Rows).

I couldn’t directly check type(obj) == pyspark.sql.types.Row so I ended up
having to do it like this:

def row_to_dict(obj):

Take a PySpark Row and convert it, and any of its nested Row
objects, into Python dictionaries.

if isinstance(obj, list):
return [row_to_dict(x) for x in obj]
else:
try:
# We can't reliably check that this is a row object
# due to some weird bug.
d = obj.asDict()
return {k: row_to_dict(v) for k, v in d.iteritems()}
except:
return obj

That comment about a “weird bug” was my initial reaction, though now I
understand that we have 2 implementations of Row.

Isn’t this worth fixing? It’s just going to confuse people, IMO.

Nick

On Tue, May 12, 2015 at 10:22 PM Davies Liu dav...@databricks.com wrote:

 The class (called Row) for rows from Spark SQL is created on the fly, is
 different from pyspark.sql.Row (is an public API to create Row by users).

 The reason we done it in this way is that we want to have better
 performance when accessing the columns. Basically, the rows are just named
 tuples (called `Row`).

 --
 Davies Liu
 Sent with Sparrow http://www.sparrowmailapp.com/?sig

 已使用 Sparrow http://www.sparrowmailapp.com/?sig

 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道:

 This is really strange.

 # Spark 1.3.1
 print type(results)

 class 'pyspark.sql.dataframe.DataFrame'

 a = results.take(1)[0]


 print type(a)

 class 'pyspark.sql.types.Row'

 print pyspark.sql.types.Row

 class 'pyspark.sql.types.Row'

 print type(a) == pyspark.sql.types.Row

 False

 print isinstance(a, pyspark.sql.types.Row)

 False

 If I set a as follows, then the type checks pass fine.

 a = pyspark.sql.types.Row('name')('Nick')

 Is this a bug? What can I do to narrow down the source?

 results is a massive DataFrame of spark-perf results.

 Nick
 ​


  ​


回复: [PySpark DataFrame] When a Row is not a Row

2015-05-12 Thread Davies Liu
The class (called Row) for rows from Spark SQL is created on the fly, is 
different from pyspark.sql.Row (is an public API to create Row by users).  

The reason we done it in this way is that we want to have better performance 
when accessing the columns. Basically, the rows are just named tuples (called 
`Row`).  

--  
Davies Liu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

已使用 Sparrow (http://www.sparrowmailapp.com/?sig)  

在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道:

 This is really strange.
  
# Spark 1.3.1
print type(results)
 

   
  
 class 'pyspark.sql.dataframe.DataFrame'
  
a = results.take(1)[0]
  
print type(a)
 class 'pyspark.sql.types.Row'
  
print pyspark.sql.types.Row
 class 'pyspark.sql.types.Row'
  
print type(a) == pyspark.sql.types.Row
 False
print isinstance(a, pyspark.sql.types.Row)

   
  
 False
  
 If I set a as follows, then the type checks pass fine.
  
 a = pyspark.sql.types.Row('name')('Nick')
  
 Is this a bug? What can I do to narrow down the source?
  
 results is a massive DataFrame of spark-perf results.
  
 Nick
 ​
  
  




[PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Nicholas Chammas
This is really strange.

 # Spark 1.3.1
 print type(results)
class 'pyspark.sql.dataframe.DataFrame'

 a = results.take(1)[0]

 print type(a)
class 'pyspark.sql.types.Row'

 print pyspark.sql.types.Row
class 'pyspark.sql.types.Row'

 print type(a) == pyspark.sql.types.Row
False
 print isinstance(a, pyspark.sql.types.Row)
False

If I set a as follows, then the type checks pass fine.

a = pyspark.sql.types.Row('name')('Nick')

Is this a bug? What can I do to narrow down the source?

results is a massive DataFrame of spark-perf results.

Nick
​


Re: [PySpark DataFrame] When a Row is not a Row

2015-05-11 Thread Ted Yu
In Row#equals():

  while (i  len) {
if (apply(i) != that.apply(i)) {

'!=' should be !apply(i).equals(that.apply(i)) ?

Cheers

On Mon, May 11, 2015 at 1:49 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 This is really strange.

  # Spark 1.3.1
  print type(results)
 class 'pyspark.sql.dataframe.DataFrame'

  a = results.take(1)[0]

  print type(a)
 class 'pyspark.sql.types.Row'

  print pyspark.sql.types.Row
 class 'pyspark.sql.types.Row'

  print type(a) == pyspark.sql.types.Row
 False
  print isinstance(a, pyspark.sql.types.Row)
 False

 If I set a as follows, then the type checks pass fine.

 a = pyspark.sql.types.Row('name')('Nick')

 Is this a bug? What can I do to narrow down the source?

 results is a massive DataFrame of spark-perf results.

 Nick
 ​