[GitHub] spark pull request #20487: [SPARK-23319][TESTS] Explicitly specify Pandas an...

HyukjinKwon Wed, 07 Feb 2018 00:36:34 -0800

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20487#discussion_r166547502
  
    --- Diff: python/pyspark/sql/session.py ---
    @@ -646,6 +646,9 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
             except Exception:
                 has_pandas = False
             if has_pandas and isinstance(data, pandas.DataFrame):
    +            from pyspark.sql.utils import require_minimum_pandas_version
    +            require_minimum_pandas_version()
    --- End diff --
    
    I don't think I exactly know all the places exactly. For now, I can think 
of: createDataFrame with Pandas DataFrame input, toPandas and pandas_udf for 
APIs, and some places in `session.py` / `types.py` for internal methods like 
`_check*` family or `*arrow*` or `*pandas*`.
    
    I was thinking of working on putting those into a single module (file) 
after 2.3.0. Will cc you and @ueshin there.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20487: [SPARK-23319][TESTS] Explicitly specify Pandas an...

Reply via email to