mgorny opened a new issue, #40153:
URL: https://github.com/apache/arrow/issues/40153
### Describe the bug, including details regarding any error messages,
version, and platform.
When running the test suite on 32-bit x86, I'm getting the following test
failures:
```
FAILED tests/test_array.py::test_dictionary_to_numpy - TypeError: Cannot
cast array data from dtype('int64') to dtype('int32') according to the rule
'safe'
FAILED tests/test_io.py::test_python_file_large_seeks - assert 5 == ((2 **
32) + 5)
FAILED tests/test_io.py::test_memory_map_large_seeks - OSError: Read out of
bounds (offset = 4294967301, size = 5) in file of size 10
FAILED tests/test_pandas.py::TestConvertStructTypes::test_from_numpy_nested
- AssertionError: assert 8 == 12
FAILED tests/test_schema.py::test_schema_sizeof - assert 28 > 30
FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string -
OverflowError: Python int too large to convert to C ssize_t
FAILED
tests/interchange/test_conversion.py::test_pandas_roundtrip_large_string -
OverflowError: Python int too large to convert to C ssize_t
FAILED
tests/interchange/test_conversion.py::test_pandas_roundtrip_string_with_missing
- OverflowError: Python int too large to convert to C ssize_t
FAILED
tests/interchange/test_conversion.py::test_pandas_roundtrip_categorical -
OverflowError: Python int too large to convert to C ssize_t
FAILED tests/interchange/test_conversion.py::test_empty_dataframe -
OverflowError: Python int too large to convert to C ssize_t
```
<details>
<summary>Tracebacks</summary>
```pytb
============================================================== FAILURES
===============================================================
______________________________________________________
test_dictionary_to_numpy _______________________________________________________
obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0],
dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}, bound = <built-in
method take of numpy.ndarray object at 0xeaca6ad0>
def _wrapfunc(obj, method, *args, **kwds):
bound = getattr(obj, method, None)
if bound is None:
return _wrapit(obj, method, *args, **kwds)
try:
> return bound(*args, **kwds)
E TypeError: Cannot cast array data from dtype('int64') to
dtype('int32') according to the rule 'safe'
args = (array([0, 1, 1, 0], dtype=int64),)
bound = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
kwds = {'axis': None, 'mode': 'raise', 'out': None}
method = 'take'
obj = array([13.7, 11. ])
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: TypeError
During handling of the above exception, another exception occurred:
def test_dictionary_to_numpy():
expected = pa.array(
["foo", "bar", None, "foo"]
).to_numpy(zero_copy_only=False)
a = pa.DictionaryArray.from_arrays(
pa.array([0, 1, None, 0]),
pa.array(['foo', 'bar'])
)
np.testing.assert_array_equal(a.to_numpy(zero_copy_only=False),
expected)
with pytest.raises(pa.ArrowInvalid):
# If this would be changed to no longer raise in the future,
# ensure to test the actual result because, currently, to_numpy
takes
# for granted that when zero_copy_only=True there will be no
nulls
# (it's the decoding of the DictionaryArray that handles the
nulls and
# this is only activated with zero_copy_only=False)
a.to_numpy(zero_copy_only=True)
anonulls = pa.DictionaryArray.from_arrays(
pa.array([0, 1, 1, 0]),
pa.array(['foo', 'bar'])
)
expected = pa.array(
["foo", "bar", "bar", "foo"]
).to_numpy(zero_copy_only=False)
np.testing.assert_array_equal(anonulls.to_numpy(zero_copy_only=False),
expected)
with pytest.raises(pa.ArrowInvalid):
anonulls.to_numpy(zero_copy_only=True)
afloat = pa.DictionaryArray.from_arrays(
pa.array([0, 1, 1, 0]),
pa.array([13.7, 11.0])
)
expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
> np.testing.assert_array_equal(afloat.to_numpy(zero_copy_only=True),
expected)
a = <pyarrow.lib.DictionaryArray object at 0xeafe6ed0>
-- dictionary:
[
"foo",
"bar"
]
-- indices:
[
0,
1,
null,
0
]
afloat = <pyarrow.lib.DictionaryArray object at 0xeafe6fb0>
-- dictionary:
[
13.7,
11
]
-- indices:
[
0,
1,
1,
0
]
anonulls = <pyarrow.lib.DictionaryArray object at 0xeafe6e60>
-- dictionary:
[
"foo",
"bar"
]
-- indices:
[
0,
1,
1,
0
]
expected = array([13.7, 11. , 11. , 13.7])
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_array.py:823:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/array.pxi:1590: in pyarrow.lib.Array.to_numpy
???
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:192: in take
return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
a = array([13.7, 11. ])
axis = None
indices = array([0, 1, 1, 0], dtype=int64)
mode = 'raise'
out = None
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:68: in _wrapfunc
return _wrapit(obj, method, *args, **kwds)
args = (array([0, 1, 1, 0], dtype=int64),)
bound = <built-in method take of numpy.ndarray object at
0xeaca6ad0>
kwds = {'axis': None, 'mode': 'raise', 'out': None}
method = 'take'
obj = array([13.7, 11. ])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0],
dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}, wrap = <built-in method
__array_wrap__ of numpy.ndarray object at 0xeaca6ad0>
def _wrapit(obj, method, *args, **kwds):
try:
wrap = obj.__array_wrap__
except AttributeError:
wrap = None
> result = getattr(asarray(obj), method)(*args, **kwds)
E TypeError: Cannot cast array data from dtype('int64') to
dtype('int32') according to the rule 'safe'
args = (array([0, 1, 1, 0], dtype=int64),)
kwds = {'axis': None, 'mode': 'raise', 'out': None}
method = 'take'
obj = array([13.7, 11. ])
wrap = <built-in method __array_wrap__ of numpy.ndarray object at
0xeaca6ad0>
/usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:45: TypeError
____________________________________________________
test_python_file_large_seeks
_____________________________________________________
def test_python_file_large_seeks():
def factory(filename):
return pa.PythonFile(open(filename, 'rb'))
> check_large_seeks(factory)
factory = <function test_python_file_large_seeks.<locals>.factory at
0xe13b6de8>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:262:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
file_factory = <function test_python_file_large_seeks.<locals>.factory at
0xe13b6de8>
def check_large_seeks(file_factory):
if sys.platform in ('win32', 'darwin'):
pytest.skip("need sparse file support")
try:
filename = tempfile.mktemp(prefix='test_io')
with open(filename, 'wb') as f:
f.truncate(2 ** 32 + 10)
f.seek(2 ** 32 + 5)
f.write(b'mark\n')
with file_factory(filename) as f:
> assert f.seek(2 ** 32 + 5) == 2 ** 32 + 5
E assert 5 == ((2 ** 32) + 5)
E + where 5 = <bound method NativeFile.seek of
<pyarrow.PythonFile closed=False own_file=False is_seekable=True
is_writable=False is_readable=True>>(((2 ** 32) + 5))
E + where <bound method NativeFile.seek of
<pyarrow.PythonFile closed=False own_file=False is_seekable=True
is_writable=False is_readable=True>> = <pyarrow.PythonFile closed=False
own_file=False is_seekable=True is_writable=False is_readable=True>.seek
f = <pyarrow.PythonFile closed=True own_file=False is_seekable=True
is_writable=False is_readable=True>
file_factory = <function test_python_file_large_seeks.<locals>.factory at
0xe13b6de8>
filename =
'/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_ioj_p6zuld'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:49:
AssertionError
_____________________________________________________
test_memory_map_large_seeks
_____________________________________________________
def test_memory_map_large_seeks():
> check_large_seeks(pa.memory_map)
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:1140:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:51:
in check_large_seeks
assert f.read(5) == b'mark\n'
f = <pyarrow.MemoryMappedFile closed=True own_file=False
is_seekable=True is_writable=False is_readable=True>
file_factory = <cyfunction memory_map at 0xf228e778>
filename =
'/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_iozl2wxbou'
pyarrow/io.pxi:409: in pyarrow.lib.NativeFile.read
???
pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OSError: Read out of bounds (offset = 4294967301, size = 5) in file of
size 10
pyarrow/error.pxi:91: OSError
____________________________________________
TestConvertStructTypes.test_from_numpy_nested
____________________________________________
self = <pyarrow.tests.test_pandas.TestConvertStructTypes object at
0xeb535d90>
def test_from_numpy_nested(self):
# Note: an object field inside a struct
dt = np.dtype([('x', np.dtype([('xx', np.int8),
('yy', np.bool_)])),
('y', np.int16),
('z', np.object_)])
# Note: itemsize is not a multiple of sizeof(object)
> assert dt.itemsize == 12
E AssertionError: assert 8 == 12
E + where 8 = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y',
'<i2'), ('z', 'O')]).itemsize
dt = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', '<i2'), ('z',
'O')])
self = <pyarrow.tests.test_pandas.TestConvertStructTypes object at
0xeb535d90>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_pandas.py:2604:
AssertionError
_________________________________________________________ test_schema_sizeof
__________________________________________________________
def test_schema_sizeof():
schema = pa.schema([
pa.field('foo', pa.int32()),
pa.field('bar', pa.string()),
])
> assert sys.getsizeof(schema) > 30
E assert 28 > 30
E + where 28 = <built-in function getsizeof>(foo: int32\nbar: string)
E + where <built-in function getsizeof> = sys.getsizeof
schema = foo: int32
bar: string
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_schema.py:684:
AssertionError
____________________________________________________
test_pandas_roundtrip_string
_____________________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_string():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip("Column.size() bug in pandas")
arr = ["a", "", "c"]
table = pa.table({"a": pa.array(arr)})
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['a', '', 'c']
pandas_df = a
0 a
1
2 c
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
a: string
----
a: [["a","","c"]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = a
0 a
1
2 c
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda1c61f0>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda1c61f0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at
0xda1c65b0>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda1c61f0>
dtype = <DtypeKind.STRING: 21>
name = 'a'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr':
3879523528, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at
0xda1c65b0>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr':
3879523528, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 2, 'ptr': 3879523528,
'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 3
offset = 0
offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1530035680,
'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1529980112,
'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
_________________________________________________
test_pandas_roundtrip_large_string
__________________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_large_string():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip("Column.size() bug in pandas")
arr = ["a", "", "c"]
table = pa.table({"a_large": pa.array(arr, type=pa.large_string())})
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
if Version(pd.__version__) >= Version("2.0.1"):
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['a', '', 'c']
pandas_df = a_large
0 a
1
2 c
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
a_large: large_string
----
a_large: [["a","","c"]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:189:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = a_large
0 a
1
2 c
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda103a10>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda103a10>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at
0xda1033d0>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda103a10>
dtype = <DtypeKind.STRING: 21>
name = 'a_large'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr':
3879522800, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at
0xda1033d0>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr':
3879522800, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 2, 'ptr': 3879522800,
'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 3
offset = 0
offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1480303312,
'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1478277616,
'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
______________________________________________
test_pandas_roundtrip_string_with_missing
______________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_string_with_missing():
# See https://github.com/pandas-dev/pandas/issues/50554
if Version(pd.__version__) < Version("1.6"):
pytest.skip("Column.size() bug in pandas")
arr = ["a", "", "c", None]
table = pa.table({"a": pa.array(arr),
"a_large": pa.array(arr, type=pa.large_string())})
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
if Version(pd.__version__) >= Version("2.0.2"):
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['a', '', 'c', None]
pandas_df = a a_large
0 a a
1
2 c c
3 NaN NaN
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
a: string
a_large: large_string
----
a: [["a","","c",null]]
a_large: [["a","","c",null]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:227:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = a a_large
0 a a
1
2 c c
3 NaN NaN
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda15b850>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda15b850>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at
0xda103210>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xda15b850>
dtype = <DtypeKind.STRING: 21>
name = 'a'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr':
3879523744, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at
0xda103210>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 2, 'ptr':
3879523744, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 2, 'ptr': 3879523744,
'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 4
offset = 0
offset_buff = PandasBuffer({'bufsize': 40, 'ptr': 1469510752,
'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 4, 'ptr': 1475420176,
'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
__________________________________________________
test_pandas_roundtrip_categorical
__________________________________________________
@pytest.mark.pandas
def test_pandas_roundtrip_categorical():
if Version(pd.__version__) < Version("2.0.2"):
pytest.skip("Bitmasks not supported in pandas interchange
implementation")
arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", None]
table = pa.table(
{"weekday": pa.array(arr).dictionary_encode()}
)
from pandas.api.interchange import (
from_dataframe as pandas_from_dataframe
)
pandas_df = pandas_from_dataframe(table)
> result = pi.from_dataframe(pandas_df)
arr = ['Mon', 'Tue', 'Mon', 'Wed', 'Mon', 'Thu', 'Fri', 'Sat', None]
pandas_df = weekday
0 Mon
1 Tue
2 Mon
3 Wed
4 Mon
5 Thu
6 Fri
7 Sat
8 NaN
pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
table = pyarrow.Table
weekday: dictionary<values=string, indices=int32, ordered=0>
----
weekday: [ -- dictionary:
["Mon","Tue","Wed","Thu","Fri","Sat"] -- indices:
[0,1,0,2,0,3,4,5,null]]
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:257:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = weekday
0 Mon
1 Tue
2 Mon
3 Wed
4 Mon
5 Thu
6 Fri
7 Sat
8 NaN
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
allow_copy = True
batches = []
chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xd9e217f0>
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xd9e217f0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:186:
in protocol_df_chunk_to_pyarrow
columns[name] = categorical_column_to_dictionary(col, allow_copy)
allow_copy = True
col = <pandas.core.interchange.column.PandasColumn object at
0xda180550>
columns = {}
df = <pandas.core.interchange.dataframe.PandasDataFrameXchg
object at 0xd9e217f0>
dtype = <DtypeKind.CATEGORICAL: 23>
name = 'weekday'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:293:
in categorical_column_to_dictionary
dictionary = column_to_array(cat_column)
allow_copy = True
cat_column = <pandas.core.interchange.column.PandasColumn object at
0xda1801d0>
categorical = {'categories':
<pandas.core.interchange.column.PandasColumn object at 0xda1801d0>,
'is_dictionary': True,
'is_ordered': False}
col = <pandas.core.interchange.column.PandasColumn object at
0xda180550>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 18, 'ptr':
3659006432, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
col = <pandas.core.interchange.column.PandasColumn object at
0xda1801d0>
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
allow_copy = True
buffers = {'data': (PandasBuffer({'bufsize': 18, 'ptr':
3659006432, 'device': 'CPU'}),
(<DtypeKind.STRING: 21>, 8, 'u', '=')),
'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device':
'CPU'}),
(<DtypeKind.INT: 0>, 64, 'l', '=')),
'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device':
'CPU'}),
(<DtypeKind.BOOL: 20>, 8, 'b', '='))}
data_buff = PandasBuffer({'bufsize': 18, 'ptr': 3659006432,
'device': 'CPU'})
data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
length = 6
offset = 0
offset_buff = PandasBuffer({'bufsize': 56, 'ptr': 1466456352,
'device': 'CPU'})
offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
validity_buff = PandasBuffer({'bufsize': 6, 'ptr': 1477427216,
'device': 'CPU'})
validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
________________________________________________________
test_empty_dataframe _________________________________________________________
def test_empty_dataframe():
schema = pa.schema([('col1', pa.int8())])
df = pa.table([[]], schema=schema)
dfi = df.__dataframe__()
> assert pi.from_dataframe(dfi) == df
df = pyarrow.Table
col1: int8
----
col1: [[]]
dfi = <pyarrow.interchange.dataframe._PyArrowDataFrame object at
0xd98381d0>
schema = col1: int8
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:522:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
in from_dataframe
return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
allow_copy = True
df = <pyarrow.interchange.dataframe._PyArrowDataFrame object
at 0xd98381d0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:140:
in _from_dataframe
batch = protocol_df_chunk_to_pyarrow(df)
allow_copy = True
batches = []
df = <pyarrow.interchange.dataframe._PyArrowDataFrame object
at 0xd96e41b0>
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
in protocol_df_chunk_to_pyarrow
columns[name] = column_to_array(col, allow_copy)
allow_copy = True
col = <pyarrow.interchange.column._PyArrowColumn object at
0xd96a6650>
columns = {}
df = <pyarrow.interchange.dataframe._PyArrowDataFrame object
at 0xd96e41b0>
dtype = <DtypeKind.INT: 0>
name = 'col1'
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
in column_to_array
data = buffers_to_array(buffers, data_type,
allow_copy = True
buffers = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr':
4122363392, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 8, 'c', '=')),
'offsets': None,
'validity': None}
col = <pyarrow.interchange.column._PyArrowColumn object at
0xd96a6650>
data_type = (<DtypeKind.INT: 0>, 8, 'c', '=')
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
in buffers_to_array
data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
_ = (<DtypeKind.INT: 0>, 8, 'c', '=')
allow_copy = True
buffers = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr':
4122363392, 'device': 'CPU'}),
(<DtypeKind.INT: 0>, 8, 'c', '=')),
'offsets': None,
'validity': None}
data_buff = PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392,
'device': 'CPU'})
data_type = (<DtypeKind.INT: 0>, 8, 'c', '=')
describe_null = (<ColumnNullType.NON_NULLABLE: 0>, None)
length = 0
offset = 0
offset_buff = None
validity_buff = None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
```
</details>
Full build & test log (2.5M):
[pyarrow.txt](https://github.com/apache/arrow/files/14343715/pyarrow.txt)
This is arrow 15.0.0 on Gentoo, with x86 systemd-nspawn container. I've used
`-O2 -march=pentium-m -mfpmath=sse -pipe` as compiler flags, to rule out
i387-specific issues.
```
>>> pyarrow.show_info()
pyarrow version info
--------------------
Package kind : not indicated
Arrow C++ library version : 15.0.0
Arrow C++ compiler : GNU 13.2.1
Arrow C++ compiler flags : -O2 -march=pentium-m -mfpmath=sse -pipe
Arrow C++ git revision :
Arrow C++ git description :
Arrow C++ build type : relwithdebinfo
Platform:
OS / Arch : Linux x86_64
SIMD Level : avx2
Detected SIMD Level : avx2
Memory:
Default backend : system
Bytes allocated : 0 bytes
Max memory : 0 bytes
Supported Backends : system
Optional modules:
csv : Enabled
cuda : -
dataset : Enabled
feather : Enabled
flight : -
fs : Enabled
gandiva : -
json : Enabled
orc : -
parquet : Enabled
Filesystems:
GcsFileSystem : -
HadoopFileSystem : Enabled
S3FileSystem : -
Compression Codecs:
brotli : Enabled
bz2 : Enabled
gzip : Enabled
lz4_frame : Enabled
lz4 : Enabled
snappy : Enabled
zstd : Enabled
```
Some of these might be problems inside pandas. I'm going to file a bug about
the test failures there in a minute, and link it here afterwards.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]