Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20507#discussion_r166018470
  
    --- Diff: python/pyspark/serializers.py ---
    @@ -230,6 +230,9 @@ def create_array(s, t):
                 s = _check_series_convert_timestamps_internal(s.fillna(0), 
timezone)
                 # TODO: need cast after Arrow conversion, ns values cause 
error with pandas 0.19.2
                 return pa.Array.from_pandas(s, mask=mask).cast(t, safe=False)
    +        elif t is not None and pa.types.is_string(t) and sys.version < '3':
    +            # TODO: need decode before converting to Arrow in Python 2
    +            return pa.Array.from_pandas(s.str.decode('utf-8'), mask=mask, 
type=t)
    --- End diff --
    
    @ueshin, actually, how about `s.apply(lambda v: v.decode("utf-8") if 
isinstance(v, str) else v)` to allow non-ascii unicodes too like `u"아"`? I 
was worried of performance but I ran a simple perf test vs 
`s.str.decode('utf-8')` for sure. Seems actually fine.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to