Re: [PySpark DataFrame] When a Row is not a Row

Nicholas Chammas Wed, 13 May 2015 08:52:16 -0700

Is there some way around this? For example, can Row just be an
implementation of namedtuple throughout?


from collections import namedtuple
class Row(namedtuple):
    ...

>From a user perspective, it’s confusing that there are 2 different
implementations of the Row class with the same name.

In my case, I was writing a method to recursively convert a Row to a dict
(since a Row can contain other Rows).

I couldn’t directly check type(obj) == pyspark.sql.types.Row so I ended up
having to do it like this:

def row_to_dict(obj):
    """
    Take a PySpark Row and convert it, and any of its nested Row
    objects, into Python dictionaries.
    """
    if isinstance(obj, list):
        return [row_to_dict(x) for x in obj]
    else:
        try:
            # We can't reliably check that this is a row object
            # due to some weird bug.
            d = obj.asDict()
            return {k: row_to_dict(v) for k, v in d.iteritems()}
        except:
            return obj

That comment about a “weird bug” was my initial reaction, though now I
understand that we have 2 implementations of Row.

Isn’t this worth fixing? It’s just going to confuse people, IMO.

Nick

On Tue, May 12, 2015 at 10:22 PM Davies Liu <[email protected]> wrote:

 The class (called Row) for rows from Spark SQL is created on the fly, is
> different from pyspark.sql.Row (is an public API to create Row by users).
>
> The reason we done it in this way is that we want to have better
> performance when accessing the columns. Basically, the rows are just named
> tuples (called `Row`).
>
> --
> Davies Liu
> Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
>
> 已使用 Sparrow <http://www.sparrowmailapp.com/?sig>
>
> 在 2015年5月12日 星期二，上午4:49，Nicholas Chammas 写道：
>
> This is really strange.
>
> # Spark 1.3.1
> print type(results)
>
> <class 'pyspark.sql.dataframe.DataFrame'>
>
> a = results.take(1)[0]
>
>
> print type(a)
>
> <class 'pyspark.sql.types.Row'>
>
> print pyspark.sql.types.Row
>
> <class 'pyspark.sql.types.Row'>
>
> print type(a) == pyspark.sql.types.Row
>
> False
>
> print isinstance(a, pyspark.sql.types.Row)
>
> False
>
> If I set a as follows, then the type checks pass fine.
>
> a = pyspark.sql.types.Row('name')('Nick')
>
> Is this a bug? What can I do to narrow down the source?
>
> results is a massive DataFrame of spark-perf results.
>
> Nick
> 
>
>
>

Re: [PySpark DataFrame] When a Row is not a Row

Reply via email to