GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/22655
[SPARK-25666][PYTHON] Internally document type conversion between Python
data and SQL types in normal UDFs
### What changes were proposed in this pull request?
We are facing some problems about type conversions between Python data and
SQL types in UDFs (Pandas UDFs as well).
It's even difficult to identify the problems (see
https://github.com/apache/spark/pull/20163 and
https://github.com/apache/spark/pull/22610).
This PR targets to internally document the type conversion table. Some of
them looks buggy and we should fix them.
```python
import array
import datetime
from decimal import Decimal
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import udf
data = [
None,
True,
1,
1L, # Python 2 only
"a",
u"a",
datetime.date(1970, 1, 1),
datetime.datetime(1970, 1, 1, 0, 0),
1.0,
array.array("i", [1]),
[1],
(1,),
bytearray([65, 66, 67]),
Decimal(1),
{"a": 1},
Row(a=1),
Row("a")(1),
]
types = [
NullType(),
BooleanType(),
ByteType(),
ShortType(),
IntegerType(),
LongType(),
StringType(),
DateType(),
TimestampType(),
FloatType(),
DoubleType(),
ArrayType(IntegerType()),
BinaryType(),
DecimalType(10, 0),
MapType(StringType(), IntegerType()),
StructType([StructField("_1", IntegerType())]),
]
df = spark.range(1)
results = []
for t in types:
result = []
for v in data:
try:
row = df.select(udf(lambda: v, t)()).first()
result.append(row[0])
except Exception:
result.append("X")
results.append([t.simpleString()] + map(str, result))
schema = ["SQL Type \\ Python Value(Type)"] + map(lambda v: "%s(%s)" %
(str(v), type(v).__name__), data)
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20,
20, False)
print("\n".join(map(lambda line: " # %s # noqa" % line,
strings.strip().split("\n"))))
```
## How was this patch tested?
Manually tested and lint check.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-25666
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22655.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22655
----
commit 3084be1de3ff58a9258dacfb8d7cf575df3fb3c9
Author: hyukjinkwon <gurwls223@...>
Date: 2018-10-06T10:59:46Z
Internally document type conversion between Python data and SQL types in
UDFs
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]