[
https://issues.apache.org/jira/browse/ARROW-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308324#comment-16308324
]
ASF GitHub Bot commented on ARROW-1941:
---------------------------------------
wesm closed pull request #1449: ARROW-1941: [Python] Fix empty list roundtrip
in to_pandas
URL: https://github.com/apache/arrow/pull/1449
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc
b/cpp/src/arrow/python/arrow_to_pandas.cc
index 08ce37cda..e21bbda05 100644
--- a/cpp/src/arrow/python/arrow_to_pandas.cc
+++ b/cpp/src/arrow/python/arrow_to_pandas.cc
@@ -90,6 +90,7 @@ struct WrapBytes<FixedSizeBinaryArray> {
static inline bool ListTypeSupported(const DataType& type) {
switch (type.id()) {
+ case Type::NA:
case Type::UINT8:
case Type::INT8:
case Type::UINT16:
@@ -695,6 +696,7 @@ class ObjectBlock : public PandasBlock {
} else if (type == Type::LIST) {
auto list_type = std::static_pointer_cast<ListType>(col->type());
switch (list_type->value_type()->id()) {
+ CONVERTLISTSLIKE_CASE(FloatType, NA)
CONVERTLISTSLIKE_CASE(UInt8Type, UINT8)
CONVERTLISTSLIKE_CASE(Int8Type, INT8)
CONVERTLISTSLIKE_CASE(UInt16Type, UINT16)
diff --git a/python/pyarrow/tests/test_convert_pandas.py
b/python/pyarrow/tests/test_convert_pandas.py
index 7609d3488..76b55cf90 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -1317,6 +1317,18 @@ def test_table_column_subset_metadata(self):
result = table_subset2.to_pandas()
tm.assert_frame_equal(result, df[['a']].reset_index(drop=True))
+ def test_empty_list_roundtrip(self):
+ empty_list_array = np.empty((3,), dtype=object)
+ empty_list_array.fill([])
+
+ df = pd.DataFrame({'a': np.array(['1', '2', '3']),
+ 'b': empty_list_array})
+ tbl = pa.Table.from_pandas(df)
+
+ result = tbl.to_pandas()
+
+ tm.assert_frame_equal(result, df)
+
def _fully_loaded_dataframe_example():
from distutils.version import LooseVersion
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Table <–> DataFrame roundtrip failing
> -------------------------------------
>
> Key: ARROW-1941
> URL: https://issues.apache.org/jira/browse/ARROW-1941
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.8.0
> Reporter: Thomas Buhrmann
> Assignee: Phillip Cloud
> Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Although it is possible to create an Arrow table with a column containing
> only empty lists (cast to a particular type, e.g. string), in a roundtrip
> through pandas the original type is lost, it seems, and subsequently attempts
> to convert to pandas then fail.
> To reproduce in PyArrow 0.8.0:
> {code}
> import pyarrow as pa
> # Create table with array of empty lists, forced to have type list(string)
> arrays = {
> 'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
> 'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
> }
> rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
> tbl = pa.Table.from_batches([rb])
> print("Schema 1 (correct):\n{}".format(tbl.schema))
> # First roundtrip changes schema
> df = tbl.to_pandas()
> tbl2 = pa.Table.from_pandas(df)
> print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))
> # Second roundtrip explodes
> df2 = tbl2.to_pandas()
> {code}
> This results in the following output:
> {code}
> Schema 1 (correct):
> c1: list<item: string>
> child 0, item: string
> c2: list<item: string>
> child 0, item: string
> Schema 2 (wrong):
> c1: list<item: string>
> child 0, item: string
> c2: list<item: null>
> child 0, item: null
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes":
> [{"na'
> b'me": null, "field_name": null, "pandas_type": "unicode",
> "numpy_'
> b'type": "object", "metadata": {"encoding": "UTF-8"}}],
> "columns":'
> b' [{"name": "c1", "field_name": "c1", "pandas_type":
> "list[unicod'
> b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2",
> "'
> b'field_name": "c2", "pandas_type": "list[float64]",
> "numpy_type":'
> b' "object", "metadata": null}, {"name": null, "field_name":
> "__in'
> b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64",
> "'
> b'metadata": null}], "pandas_version": "0.21.1"}'}
> ...
> > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock:
> > null
> {code}
> I.e., the array of empty lists of strings gets converted into an array of
> lists of type null, and in the pandas schema to lists of type float64.
> If one changes the empty lists to values of None in the creation of the
> record batches, the roundtrip doesn't explode, but it will silently convert
> the column to a simple column of type float (i.e. I lose the list type) in
> pandas. This doesn't help, since other batches from the same source might
> have non-empty lists and would end up with a different inferred schema, and
> so can't be concatenated into a single table.
> (If this attempt at a double roundtrip seems weird, in my use case I receive
> data from a server in RecordBatches, which I convert to pandas for
> manipulation. I then serialize this data to disk using Arrow, and later need
> to read it back into pandas again for further manipulation. So I need to be
> able to go through various rounds of table->df->table->df->table etc., where
> at any time a record batch may have columns that contain only empty lists).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)