[GitHub] [spark] moskvax commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

GitBox Thu, 11 Jun 2020 07:38:16 -0700


moskvax commented on pull request #28743:
URL: https://github.com/apache/spark/pull/28743#issuecomment-642700256



   > Thanks @moskvax , adding support for extension types would be great! I'm 
not sure using `pa.infer_type` is the way to go though, I think it's better to 
handle these cases explicitly by getting the `pa.ExtensionType` from 
`pa.Schema.from_pandas` and then extracting the `storage_type` from there. 
Would that be possible?
   
   The goal of this PR was to allow conversion for 
`__arrow_array__`-implementing arrays of `ExtensionDtype` values where the 
underlying type can be directly converted to primitive Arrow and Spark types, 
so I wasn't focusing on this case at first, but I've looked into it today 
following the approach you described. 
   
   The `storage_type` of the `pa.ExtensionType` of `PeriodArray` is `int64`, 
which can be converted to a Spark column using the `PeriodArray`'s 
`_ndarray_values`. However, without the `PeriodDtype.freq`, the period 
information cannot be reconstructed and the result in Spark is an 
arbitrary-looking sequence of integers: 
   
   ```pycon
   >>> periods = pd.period_range('2020-01-01', freq='M', periods=6)
   >>> pdf = pd.DataFrame({'A': pd.Series(periods)})
   >>> pdf
            A
   0  2020-01
   1  2020-02
   2  2020-03
   3  2020-04
   4  2020-05
   5  2020-06
   >>> pdf.dtypes
   A    period[M]
   dtype: object
   >>> df = spark.createDataFrame(pdf)
   >>> df.show()
   +---+
   |  A|
   +---+
   |600|
   |601|
   |602|
   |603|
   |604|
   |605|
   +---+
   
   >>> df.schema
   StructType(List(StructField(A,LongType,true)))
   ```
   
   `IntervalArray` has an Arrow extension type with a `storage_type` of 
`StructType(struct<left: timestamp[ns], right: timestamp[ns]>)`, which could be 
converted to a Spark `StructType` column if `StructType` conversion were 
supported by the Arrow conversion path, however the `closed` information would 
still be missing using this schema.
   
   So, in the cases where it is possible to convert using the `storage_type`, I 
think there should be a warning that the results may be unexpected as any type 
metadata that may be required to meaningfully interpret the type values is 
being discarded. Additionally, the round-trip back to pandas won't be possible 
for these types.
   
   As for `pa.Schema.from_pandas`, it's most useful over `pa.infer_type` for 
the purposes of Spark conversion when the array it is processing implements 
`__arrow_array__` and thus can immediately and unambiguously return its own 
Arrow type. I've updated the PR to firstly try using `__arrow_array__` to 
determine a type, then falling back on `pa.infer_type`. What do you think of 
this approach?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] moskvax commented on pull request #28743: [SPARK-31920][PYTHON] Fix pandas conversion using Arrow with __arrow_array__ columns

Reply via email to