ghiggi commented on issue #35802:
URL: https://github.com/apache/arrow/issues/35802#issuecomment-3927466464
I think it can be close. Just for future me or others:
```
import pandas as pd
import pyarrow as pa
import dask.dataframe as dd
# Create example table
table = pa.table({'a': [1, 2, 3], 'b': ["a", "b", "c"]})
# This raise error: ArrowInvalid: Cannot do zero copy conversion into
multi-column DataFrame block
df = table.to_pandas(types_mapper=pd.ArrowDtype, zero_copy_only=True)
# Setting split_blocks=True allows to convert to pandas with zero copy
df = table.to_pandas(types_mapper=pd.ArrowDtype, zero_copy_only=True,
split_blocks=True)
# If working with daks dataframe ...
# - Let's write a parquet file
filepath = "/tmp/example.parquet"
df.to_parquet(filepath)
# This fail with ArrowInvalid: Cannot do zero copy conversion into
multi-column DataFrame block
arrow_to_pandas = {
"types_mapper": pd.ArrowDtype,
"zero_copy_only": True,
}
df = dd.read_parquet(filepath,
engine="pyarrow",
dtype_backend="pyarrow",
arrow_to_pandas = arrow_to_pandas)
# This works
arrow_to_pandas = {
"types_mapper": pd.ArrowDtype,
"zero_copy_only": False,
"split_blocks": True, # !!!
}
df = dd.read_parquet(filepath,
engine="pyarrow",
dtype_backend="pyarrow",
arrow_to_pandas = arrow_to_pandas)
df.compute()
```
These issues also addressed the problem:
- https://github.com/apache/arrow/issues/38644
- https://github.com/apache/arrow/issues/39194
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]