Operations on DataFrames with User Defined Types in pyspark

Franklyn D'souza Thu, 11 Feb 2016 13:42:47 -0800

I'm using the UDT api to work with a custom Money datatype in dataframes.
heres how i have it setup


class StringUDT(UserDefinedType):


    @classmethod
    def sqlType(self):
        return StringType()

    @classmethod
    def module(cls):
        return cls.__module__

    @classmethod
    def scalaUDT(cls):
        return ''

    def serialize(self, obj):
        return str(obj)

    def deserialize(self, datum):
        return Money(datum)


class MoneyUDT(StringUDT):
    pass

Money.__UDT__ = MoneyUDT()

I then create a DataFrame like so

df = sc.sql.createDataFrame([[Money("25.0")], [Money("100.0")]], spark_schema)

However i've run into a few snags with this. DFs created using this
UDT can not be orderedBy the UDT column and i can't Union two DFs that
have this UDT on one of their columns.

Is this expected behaviour ? or is my UDT setup wrong ?.

Operations on DataFrames with User Defined Types in pyspark

Reply via email to