Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20507#discussion_r166018470
--- Diff: python/pyspark/serializers.py ---
@@ -230,6 +230,9 @@ def create_array(s, t):
s = _check_series_convert_timestamps_internal(s.fillna(0),
timezone)
# TODO: need cast after Arrow conversion, ns values cause
error with pandas 0.19.2
return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
+ elif t is not None and pa.types.is_string(t) and sys.version < '3':
+ # TODO: need decode before converting to Arrow in Python 2
+ return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask,
type=t)
--- End diff --
@ueshin, actually, how about `s.apply(lambda v: v.decode("utf-8") if
isinstance(v, str) else v)` to allow non-ascii unicodes too like `u"ì"`? I
was worried of performance but I ran a simple perf test vs
`s.str.decode('utf-8')` for sure. Seems actually fine.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]