Is there some way around this? For example, can Row just be an
implementation of namedtuple throughout?
from collections import namedtuple
class Row(namedtuple):
...
>From a user perspective, it’s confusing that there are 2 different
implementations of the Row class with the same name.
In my case, I was writing a method to recursively convert a Row to a dict
(since a Row can contain other Rows).
I couldn’t directly check type(obj) == pyspark.sql.types.Row so I ended up
having to do it like this:
def row_to_dict(obj):
"""
Take a PySpark Row and convert it, and any of its nested Row
objects, into Python dictionaries.
"""
if isinstance(obj, list):
return [row_to_dict(x) for x in obj]
else:
try:
# We can't reliably check that this is a row object
# due to some weird bug.
d = obj.asDict()
return {k: row_to_dict(v) for k, v in d.iteritems()}
except:
return obj
That comment about a “weird bug” was my initial reaction, though now I
understand that we have 2 implementations of Row.
Isn’t this worth fixing? It’s just going to confuse people, IMO.
Nick
On Tue, May 12, 2015 at 10:22 PM Davies Liu <[email protected]> wrote:
The class (called Row) for rows from Spark SQL is created on the fly, is
> different from pyspark.sql.Row (is an public API to create Row by users).
>
> The reason we done it in this way is that we want to have better
> performance when accessing the columns. Basically, the rows are just named
> tuples (called `Row`).
>
> --
> Davies Liu
> Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
>
> 已使用 Sparrow <http://www.sparrowmailapp.com/?sig>
>
> 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道:
>
> This is really strange.
>
> # Spark 1.3.1
> print type(results)
>
> <class 'pyspark.sql.dataframe.DataFrame'>
>
> a = results.take(1)[0]
>
>
> print type(a)
>
> <class 'pyspark.sql.types.Row'>
>
> print pyspark.sql.types.Row
>
> <class 'pyspark.sql.types.Row'>
>
> print type(a) == pyspark.sql.types.Row
>
> False
>
> print isinstance(a, pyspark.sql.types.Row)
>
> False
>
> If I set a as follows, then the type checks pass fine.
>
> a = pyspark.sql.types.Row('name')('Nick')
>
> Is this a bug? What can I do to narrow down the source?
>
> results is a massive DataFrame of spark-perf results.
>
> Nick
>
>
>
>