spark git commit: [SPARK-25798][PYTHON] Internally document type conversion between Pandas data and SQL types in Pandas UDFs

cutlerb Wed, 24 Oct 2018 10:04:43 -0700

Repository: spark
Updated Branches:
  refs/heads/master b19a28dea -> 7251be0c0



[SPARK-25798][PYTHON] Internally document type conversion between Pandas data 
and SQL types in Pandas UDFs

## What changes were proposed in this pull request?

We are facing some problems about type conversions between Pandas data and SQL 
types in Pandas UDFs.
It's even difficult to identify the problems (see #20163 and #22610).

This PR targets to internally document the type conversion table. Some of them 
looks buggy and we should fix them.

Table can be generated via the codes below:

```python
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf

columns = [
    ('none', 'object(NoneType)'),
    ('bool', 'bool'),
    ('int8', 'int8'),
    ('int16', 'int16'),
    ('int32', 'int32'),
    ('int64', 'int64'),
    ('uint8', 'uint8'),
    ('uint16', 'uint16'),
    ('uint32', 'uint32'),
    ('uint64', 'uint64'),
    ('float64', 'float16'),
    ('float64', 'float32'),
    ('float64', 'float64'),
    ('date', 'datetime64[ns]'),
    ('tz_aware_dates', 'datetime64[ns, US/Eastern]'),
    ('string', 'object(string)'),
    ('decimal', 'object(Decimal)'),
    ('array', 'object(array[int32])'),
    ('float128', 'float128'),
    ('complex64', 'complex64'),
    ('complex128', 'complex128'),
    ('category', 'category'),
    ('tdeltas', 'timedelta64[ns]'),
]

def create_dataframe():
    import pandas as pd
    import numpy as np
    import decimal
    pdf = pd.DataFrame({
        'none': [None, None],
        'bool': [True, False],
        'int8': np.arange(1, 3).astype('int8'),
        'int16': np.arange(1, 3).astype('int16'),
        'int32': np.arange(1, 3).astype('int32'),
        'int64': np.arange(1, 3).astype('int64'),
        'uint8': np.arange(1, 3).astype('uint8'),
        'uint16': np.arange(1, 3).astype('uint16'),
        'uint32': np.arange(1, 3).astype('uint32'),
        'uint64': np.arange(1, 3).astype('uint64'),
        'float16': np.arange(1, 3).astype('float16'),
        'float32': np.arange(1, 3).astype('float32'),
        'float64': np.arange(1, 3).astype('float64'),
        'float128': np.arange(1, 3).astype('float128'),
        'complex64': np.arange(1, 3).astype('complex64'),
        'complex128': np.arange(1, 3).astype('complex128'),
        'string': list('ab'),
        'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 
2, 3], dtype=np.int32)]),
        'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]),
        'date': pd.date_range('19700101', periods=2).values,
        'category': pd.Series(list("AB")).astype('category')})
    pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]]
    pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, 
tz='US/Eastern')
    return pdf

types =  [
    BooleanType(),
    ByteType(),
    ShortType(),
    IntegerType(),
    LongType(),
    FloatType(),
    DoubleType(),
    DateType(),
    TimestampType(),
    StringType(),
    DecimalType(10, 0),
    ArrayType(IntegerType()),
    MapType(StringType(), IntegerType()),
    StructType([StructField("_1", IntegerType())]),
    BinaryType(),
]

df = spark.range(2).repartition(1)
results = []
count = 0
total = len(types) * len(columns)
values = []
spark.sparkContext.setLogLevel("FATAL")
for t in types:
    result = []
    for column, pandas_t in columns:
        v = create_dataframe()[column][0]
        values.append(v)
        try:
            row = df.select(pandas_udf(lambda _: create_dataframe()[column], 
t)(df.id)).first()
            ret_str = repr(row[0])
        except Exception:
            ret_str = "X"
        result.append(ret_str)
        progress = "SQL Type: [%s]\n  Pandas Value(Type): %s(%s)]\n  Result 
Python Value: [%s]" % (
            t.simpleString(), v, pandas_t, ret_str)
        count += 1
        print("%s/%s:\n  %s" % (count, total, progress))
    results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: 
"%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns)))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, 
False)
print("\n".join(map(lambda line: "    # %s  # noqa" % line, 
strings.strip().split("\n"))))

```

This code is compatible with both Python 2 and 3 but the table was generated 
under Python 2.

## How was this patch tested?

Manually tested and lint check.

Closes #22795 from HyukjinKwon/SPARK-25798.

Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: Bryan Cutler <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7251be0c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7251be0c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7251be0c

Branch: refs/heads/master
Commit: 7251be0c04f0380208e0197e559158a9e1400868
Parents: b19a28d
Author: hyukjinkwon <[email protected]>
Authored: Wed Oct 24 10:04:17 2018 -0700
Committer: Bryan Cutler <[email protected]>
Committed: Wed Oct 24 10:04:17 2018 -0700

----------------------------------------------------------------------
 python/pyspark/sql/functions.py | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/7251be0c/python/pyspark/sql/functions.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 2694e77..8b2e423 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -3023,6 +3023,42 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
         conversion on returned data. The conversion is not guaranteed to be 
correct and results
         should be checked for accuracy by users.
     """
+
+    # The following table shows most of Pandas data and SQL type conversions 
in Pandas UDFs that
+    # are not yet visible to the user. Some of behaviors are buggy and might 
be changed in the near
+    # future. The table might have to be eventually documented externally.
+    # Please see SPARK-25798's PR to see the codes in order to generate the 
table below.
+    #
+    # 
+-----------------------------+----------------------+----------+-------+--------+--------------------+--------------------+--------+---------+---------+---------+------------+------------+------------+-----------------------------------+-----------------------------------------------------+-----------------+--------------------+-----------------------------+-------------+-----------------+------------------+-----------+--------------------------------+
  # noqa
+    # |SQL Type \ Pandas 
Value(Type)|None(object(NoneType))|True(bool)|1(int8)|1(int16)|            
1(int32)|            
1(int64)|1(uint8)|1(uint16)|1(uint32)|1(uint64)|1.0(float16)|1.0(float32)|1.0(float64)|1970-01-01
 00:00:00(datetime64[ns])|1970-01-01 00:00:00-05:00(datetime64[ns, 
US/Eastern])|a(object(string))|  1(object(Decimal))|[1 2 
3](object(array[int32]))|1.0(float128)|(1+0j)(complex64)|(1+0j)(complex128)|A(category)|1
 days 00:00:00(timedelta64[ns])|  # noqa
+    # 
+-----------------------------+----------------------+----------+-------+--------+--------------------+--------------------+--------+---------+---------+---------+------------+------------+------------+-----------------------------------+-----------------------------------------------------+-----------------+--------------------+-----------------------------+-------------+-----------------+------------------+-----------+--------------------------------+
  # noqa
+    # |                      boolean|                  None|      True|   
True|    True|                True|                True|    True|     True|     
True|     True|       False|       False|       False|                          
    False|                                                False|                
X|                   X|                            X|        False|            
False|             False|          X|                           False|  # noqa
+    # |                      tinyint|                  None|         1|      
1|       1|                   1|                   1|       X|        X|        
X|        X|           1|           1|           1|                             
     X|                                                    X|                X| 
                  X|                            X|            X|                
X|                 X|          0|                               X|  # noqa
+    # |                     smallint|                  None|         1|      
1|       1|                   1|                   1|       1|        X|        
X|        X|           1|           1|           1|                             
     X|                                                    X|                X| 
                  X|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # |                          int|                  None|         1|      
1|       1|                   1|                   1|       1|        1|        
X|        X|           1|           1|           1|                             
     X|                                                    X|                X| 
                  X|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # |                       bigint|                  None|         1|      
1|       1|                   1|                   1|       1|        1|        
1|        X|           1|           1|           1|                             
     0|                                       18000000000000|                X| 
                  X|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # |                        float|                  None|       1.0|    
1.0|     1.0|                 1.0|                 1.0|     1.0|      1.0|      
1.0|      1.0|         1.0|         1.0|         1.0|                           
       X|                                                    X|                
X|1.401298464324817...|                            X|            X|             
   X|                 X|          X|                               X|  # noqa
+    # |                       double|                  None|       1.0|    
1.0|     1.0|                 1.0|                 1.0|     1.0|      1.0|      
1.0|      1.0|         1.0|         1.0|         1.0|                           
       X|                                                    X|                
X|                   X|                            X|            X|             
   X|                 X|          X|                               X|  # noqa
+    # |                         date|                  None|         X|      
X|       X|datetime.date(197...|                   X|       X|        X|        
X|        X|           X|           X|           X|               
datetime.date(197...|                                                    X|     
           X|                   X|                            X|            X|  
              X|                 X|          X|                               
X|  # noqa
+    # |                    timestamp|                  None|         X|      
X|       X|                   X|datetime.datetime...|       X|        X|        
X|        X|           X|           X|           X|               
datetime.datetime...|                                 datetime.datetime...|     
           X|                   X|                            X|            X|  
              X|                 X|          X|                               
X|  # noqa
+    # |                       string|                  None|       
u''|u'\x01'| u'\x01'|             u'\x01'|             u'\x01'| u'\x01'|  
u'\x01'|  u'\x01'|  u'\x01'|         u''|         u''|         u''|             
                     X|                                                    X|   
          u'a'|                   X|                            X|          
u''|              u''|               u''|          X|                           
    X|  # noqa
+    # |                decimal(10,0)|                  None|         X|      
X|       X|                   X|                   X|       X|        X|        
X|        X|           X|           X|           X|                             
     X|                                                    X|                X| 
       Decimal('1')|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # |                   array<int>|                  None|         X|      
X|       X|                   X|                   X|       X|        X|        
X|        X|           X|           X|           X|                             
     X|                                                    X|                X| 
                  X|                    [1, 2, 3]|            X|                
X|                 X|          X|                               X|  # noqa
+    # |              map<string,int>|                     X|         X|      
X|       X|                   X|                   X|       X|        X|        
X|        X|           X|           X|           X|                             
     X|                                                    X|                X| 
                  X|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # |               struct<_1:int>|                     X|         X|      
X|       X|                   X|                   X|       X|        X|        
X|        X|           X|           X|           X|                             
     X|                                                    X|                X| 
                  X|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # |                       binary|                     X|         X|      
X|       X|                   X|                   X|       X|        X|        
X|        X|           X|           X|           X|                             
     X|                                                    X|                X| 
                  X|                            X|            X|                
X|                 X|          X|                               X|  # noqa
+    # 
+-----------------------------+----------------------+----------+-------+--------+--------------------+--------------------+--------+---------+---------+---------+------------+------------+------------+-----------------------------------+-----------------------------------------------------+-----------------+--------------------+-----------------------------+-------------+-----------------+------------------+-----------+--------------------------------+
  # noqa
+    #
+    # Note: DDL formatted string is used for 'SQL Type' for simplicity. This 
string can be
+    #       used in `returnType`.
+    # Note: The values inside of the table are generated by `repr`.
+    # Note: Python 2 is used to generate this table since it is used to check 
the backward
+    #       compatibility often in practice.
+    # Note: Pandas 0.19.2 and PyArrow 0.9.0 are used.
+    # Note: Timezone is Singapore timezone.
+    # Note: 'X' means it throws an exception during the conversion.
+    # Note: 'binary' type is only supported with PyArrow 0.10.0+ (SPARK-23555).
+
     # decorator @pandas_udf(returnType, functionType)
     is_decorator = f is None or isinstance(f, (str, DataType))
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-25798][PYTHON] Internally document type conversion between Pandas data and SQL types in Pandas UDFs

Reply via email to