[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

BryanCutler Mon, 23 Oct 2017 16:19:15 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18664#discussion_r146419125
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -224,7 +225,13 @@ def _create_batch(series):
         # If a nullable integer series has been promoted to floating point 
with NaNs, need to cast
         # NOTE: this is not necessary with Arrow >= 0.7
         def cast_series(s, t):
    -        if t is None or s.dtype == t.to_pandas_dtype():
    +        if type(t) == pa.TimestampType:
    +            # NOTE: convert to 'us' with astype here, unit ignored in 
`from_pandas` see ARROW-1680
    +            return 
_series_convert_timestamps_internal(s).values.astype('datetime64[us]')
    +        elif t == pa.date32():
    +            # TODO: ValueError: Cannot cast DatetimeIndex to dtype 
datetime64[D]
    --- End diff --
    
    I came across an issue when using `pandas_udf` that returns a `DateType`.  
Dates are input to the udf as Series with dtype `datetime64[ns]` and trying to 
use this for `pa.Array.from_pandas` with `type=pa.date32()` fails with an 
error.  I am also unable to call `series.dt.values.astype('datetime64[D]')` 
which results in an error.  Without specifying the type, pyarrow will read the 
values as a timestamp.  I filed 
https://issues.apache.org/jira/browse/ARROW-1718 to look into this, aside from 
this being fixed in a new version any ideas for a workaround @ueshin and 
@HyukjinKwon ?
    cc @wesm



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

Reply via email to