[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

BryanCutler Mon, 30 Oct 2017 16:14:44 -0700

Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/19459
  
    After incorporating date and timestamp types for this, I had to refactor a 
little to use `_create_batch` from serializers to make Arrow batches from 
Columns even when the user doesn't specify the schema to be able to use the 
casts for these types. It doesn't seem to affect performance from the initial 
benchmark.
    
    I came across an issue when using pandas DataFrame with timestamps without 
Arrow.  Spark will read values as long and not datetime, so currently a test 
for this will fail
    
    ```
    In [1]: spark.conf.set("spark.sql.execution.arrow.enabled", "false")
    
    In [2]: import pandas as pd
       ...: from datetime import datetime
       ...: 
    
    In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]})
    
    In [4]: df = spark.createDataFrame(pdf)
    
    In [5]: df.show()
    +-------------------+
    |                 ts|
    +-------------------+
    |1509411661000000000|
    +-------------------+
    
    
    In [6]: df.schema
    Out[6]: StructType(List(StructField(ts,LongType,true)))
    
    In [7]: pdf
    Out[7]: 
                       ts
    0 2017-10-31 01:01:01
    
    In [9]: pdf.dtypes
    Out[9]: 
    ts    datetime64[ns]
    dtype: object
    ```
    @HyukjinKwon or @ueshin could you confirm you see the same? and do you 
consider this a bug?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

Reply via email to