Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/18664#discussion_r146641097
--- Diff: python/pyspark/serializers.py ---
@@ -224,7 +225,13 @@ def _create_batch(series):
# If a nullable integer series has been promoted to floating point
with NaNs, need to cast
# NOTE: this is not necessary with Arrow >= 0.7
def cast_series(s, t):
- if t is None or s.dtype == t.to_pandas_dtype():
+ if type(t) == pa.TimestampType:
+ # NOTE: convert to 'us' with astype here, unit ignored in
`from_pandas` see ARROW-1680
+ return
_series_convert_timestamps_internal(s).values.astype('datetime64[us]')
--- End diff --
Why is that? We did that for integers that were promoted to floats to get
rid of NaN, but here we are converting datetime64[ns] to datetime64[us] and
both support missing values
```
In [28]: s = pd.Series([pd.datetime.now(), None])
In [29]: s
Out[29]:
0 2017-10-24 10:44:51.483694
1 NaT
dtype: datetime64[ns]
In [33]: s.values.astype('datetime64[us]')
Out[33]: array(['2017-10-24T10:44:51.483694', 'NaT'],
dtype='datetime64[us]')
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]