[GitHub] [spark] BryanCutler commented on a change in pull request #24095: [SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality

GitBox Wed, 20 Mar 2019 13:48:22 -0700

BryanCutler commented on a change in pull request #24095: [SPARK-27163][PYTHON] 
Cleanup and consolidate Pandas UDF functionality
URL: https://github.com/apache/spark/pull/24095#discussion_r267539397


 ##########
 File path: python/pyspark/sql/session.py
 ##########
 @@ -530,15 +530,29 @@ def _create_from_pandas_with_arrow(self, pdf, schema, 
timezone):
         to Arrow data, then sending to the JVM to parallelize. If a schema is 
passed in, the
         data types will be used to coerce the data in Pandas to Arrow 
conversion.
         """
-        from pyspark.serializers import ArrowStreamSerializer, _create_batch
-        from pyspark.sql.types import from_arrow_schema, to_arrow_type, 
TimestampType
+        from distutils.version import LooseVersion
+        from pyspark.serializers import ArrowStreamPandasSerializer
+        from pyspark.sql.types import from_arrow_type, to_arrow_type, 
TimestampType
         from pyspark.sql.utils import require_minimum_pandas_version, \
             require_minimum_pyarrow_version
 
         require_minimum_pandas_version()
         require_minimum_pyarrow_version()
 
         from pandas.api.types import is_datetime64_dtype, is_datetime64tz_dtype
+        import pyarrow as pa
+
+        # Create the Spark schema from list of names passed in with Arrow types
+        if isinstance(schema, (list, tuple)):
+            if LooseVersion(pa.__version__) < LooseVersion("0.12.0"):
+                temp_batch = pa.RecordBatch.from_pandas(pdf[0:100], 
preserve_index=False)
 
 Review comment:
   I'm not too thrilled with creating a record batch just to get the Arrow 
schema, but this was the most reliable way I could figure to do it pre v0.12.0. 
 I will propose bumping the pyarrow version soon, and then this could be 
removed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] BryanCutler commented on a change in pull request #24095: [SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality

Reply via email to