mgorny opened a new issue, #40153:
URL: https://github.com/apache/arrow/issues/40153

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When running the test suite on 32-bit x86, I'm getting the following test 
failures:
   
   ```
   FAILED tests/test_array.py::test_dictionary_to_numpy - TypeError: Cannot 
cast array data from dtype('int64') to dtype('int32') according to the rule 
'safe'
   FAILED tests/test_io.py::test_python_file_large_seeks - assert 5 == ((2 ** 
32) + 5)
   FAILED tests/test_io.py::test_memory_map_large_seeks - OSError: Read out of 
bounds (offset = 4294967301, size = 5) in file of size 10
   FAILED tests/test_pandas.py::TestConvertStructTypes::test_from_numpy_nested 
- AssertionError: assert 8 == 12
   FAILED tests/test_schema.py::test_schema_sizeof - assert 28 > 30
   FAILED tests/interchange/test_conversion.py::test_pandas_roundtrip_string - 
OverflowError: Python int too large to convert to C ssize_t
   FAILED 
tests/interchange/test_conversion.py::test_pandas_roundtrip_large_string - 
OverflowError: Python int too large to convert to C ssize_t
   FAILED 
tests/interchange/test_conversion.py::test_pandas_roundtrip_string_with_missing 
- OverflowError: Python int too large to convert to C ssize_t
   FAILED 
tests/interchange/test_conversion.py::test_pandas_roundtrip_categorical - 
OverflowError: Python int too large to convert to C ssize_t
   FAILED tests/interchange/test_conversion.py::test_empty_dataframe - 
OverflowError: Python int too large to convert to C ssize_t
   ```
   
   <details>
   <summary>Tracebacks</summary>
   
   ```pytb
   ============================================================== FAILURES 
===============================================================
   ______________________________________________________ 
test_dictionary_to_numpy _______________________________________________________
   
   obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0], 
dtype=int64),)
   kwds = {'axis': None, 'mode': 'raise', 'out': None}, bound = <built-in 
method take of numpy.ndarray object at 0xeaca6ad0>
   
       def _wrapfunc(obj, method, *args, **kwds):
           bound = getattr(obj, method, None)
           if bound is None:
               return _wrapit(obj, method, *args, **kwds)
       
           try:
   >           return bound(*args, **kwds)
   E           TypeError: Cannot cast array data from dtype('int64') to 
dtype('int32') according to the rule 'safe'
   
   args       = (array([0, 1, 1, 0], dtype=int64),)
   bound      = <built-in method take of numpy.ndarray object at 0xeaca6ad0>
   kwds       = {'axis': None, 'mode': 'raise', 'out': None}
   method     = 'take'
   obj        = array([13.7, 11. ])
   
   /usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:59: TypeError
   
   During handling of the above exception, another exception occurred:
   
       def test_dictionary_to_numpy():
           expected = pa.array(
               ["foo", "bar", None, "foo"]
           ).to_numpy(zero_copy_only=False)
           a = pa.DictionaryArray.from_arrays(
               pa.array([0, 1, None, 0]),
               pa.array(['foo', 'bar'])
           )
           np.testing.assert_array_equal(a.to_numpy(zero_copy_only=False),
                                         expected)
       
           with pytest.raises(pa.ArrowInvalid):
               # If this would be changed to no longer raise in the future,
               # ensure to test the actual result because, currently, to_numpy 
takes
               # for granted that when zero_copy_only=True there will be no 
nulls
               # (it's the decoding of the DictionaryArray that handles the 
nulls and
               # this is only activated with zero_copy_only=False)
               a.to_numpy(zero_copy_only=True)
       
           anonulls = pa.DictionaryArray.from_arrays(
               pa.array([0, 1, 1, 0]),
               pa.array(['foo', 'bar'])
           )
           expected = pa.array(
               ["foo", "bar", "bar", "foo"]
           ).to_numpy(zero_copy_only=False)
           
np.testing.assert_array_equal(anonulls.to_numpy(zero_copy_only=False),
                                         expected)
       
           with pytest.raises(pa.ArrowInvalid):
               anonulls.to_numpy(zero_copy_only=True)
       
           afloat = pa.DictionaryArray.from_arrays(
               pa.array([0, 1, 1, 0]),
               pa.array([13.7, 11.0])
           )
           expected = pa.array([13.7, 11.0, 11.0, 13.7]).to_numpy()
   >       np.testing.assert_array_equal(afloat.to_numpy(zero_copy_only=True),
                                         expected)
   
   a          = <pyarrow.lib.DictionaryArray object at 0xeafe6ed0>
   
   -- dictionary:
     [
       "foo",
       "bar"
     ]
   -- indices:
     [
       0,
       1,
       null,
       0
     ]
   afloat     = <pyarrow.lib.DictionaryArray object at 0xeafe6fb0>
   
   -- dictionary:
     [
       13.7,
       11
     ]
   -- indices:
     [
       0,
       1,
       1,
       0
     ]
   anonulls   = <pyarrow.lib.DictionaryArray object at 0xeafe6e60>
   
   -- dictionary:
     [
       "foo",
       "bar"
     ]
   -- indices:
     [
       0,
       1,
       1,
       0
     ]
   expected   = array([13.7, 11. , 11. , 13.7])
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_array.py:823:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   pyarrow/array.pxi:1590: in pyarrow.lib.Array.to_numpy
       ???
   /usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:192: in take
       return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
           a          = array([13.7, 11. ])
           axis       = None
           indices    = array([0, 1, 1, 0], dtype=int64)
           mode       = 'raise'
           out        = None
   /usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:68: in _wrapfunc
       return _wrapit(obj, method, *args, **kwds)
           args       = (array([0, 1, 1, 0], dtype=int64),)
           bound      = <built-in method take of numpy.ndarray object at 
0xeaca6ad0>
           kwds       = {'axis': None, 'mode': 'raise', 'out': None}
           method     = 'take'
           obj        = array([13.7, 11. ])
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   obj = array([13.7, 11. ]), method = 'take', args = (array([0, 1, 1, 0], 
dtype=int64),)
   kwds = {'axis': None, 'mode': 'raise', 'out': None}, wrap = <built-in method 
__array_wrap__ of numpy.ndarray object at 0xeaca6ad0>
   
       def _wrapit(obj, method, *args, **kwds):
           try:
               wrap = obj.__array_wrap__
           except AttributeError:
               wrap = None
   >       result = getattr(asarray(obj), method)(*args, **kwds)
   E       TypeError: Cannot cast array data from dtype('int64') to 
dtype('int32') according to the rule 'safe'
   
   args       = (array([0, 1, 1, 0], dtype=int64),)
   kwds       = {'axis': None, 'mode': 'raise', 'out': None}
   method     = 'take'
   obj        = array([13.7, 11. ])
   wrap       = <built-in method __array_wrap__ of numpy.ndarray object at 
0xeaca6ad0>
   
   /usr/lib/python3.11/site-packages/numpy/core/fromnumeric.py:45: TypeError
   ____________________________________________________ 
test_python_file_large_seeks 
_____________________________________________________
   
       def test_python_file_large_seeks():
           def factory(filename):
               return pa.PythonFile(open(filename, 'rb'))
       
   >       check_large_seeks(factory)
   
   factory    = <function test_python_file_large_seeks.<locals>.factory at 
0xe13b6de8>
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:262:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   file_factory = <function test_python_file_large_seeks.<locals>.factory at 
0xe13b6de8>
   
       def check_large_seeks(file_factory):
           if sys.platform in ('win32', 'darwin'):
               pytest.skip("need sparse file support")
           try:
               filename = tempfile.mktemp(prefix='test_io')
               with open(filename, 'wb') as f:
                   f.truncate(2 ** 32 + 10)
                   f.seek(2 ** 32 + 5)
                   f.write(b'mark\n')
               with file_factory(filename) as f:
   >               assert f.seek(2 ** 32 + 5) == 2 ** 32 + 5
   E               assert 5 == ((2 ** 32) + 5)
   E                +  where 5 = <bound method NativeFile.seek of 
<pyarrow.PythonFile closed=False own_file=False is_seekable=True 
is_writable=False is_readable=True>>(((2 ** 32) + 5))
   E                +    where <bound method NativeFile.seek of 
<pyarrow.PythonFile closed=False own_file=False is_seekable=True 
is_writable=False is_readable=True>> = <pyarrow.PythonFile closed=False 
own_file=False is_seekable=True is_writable=False is_readable=True>.seek
   
   f          = <pyarrow.PythonFile closed=True own_file=False is_seekable=True 
is_writable=False is_readable=True>
   file_factory = <function test_python_file_large_seeks.<locals>.factory at 
0xe13b6de8>
   filename   = 
'/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_ioj_p6zuld'
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:49:
 AssertionError
   _____________________________________________________ 
test_memory_map_large_seeks 
_____________________________________________________
   
       def test_memory_map_large_seeks():
   >       check_large_seeks(pa.memory_map)
   
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:1140:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_io.py:51:
 in check_large_seeks
       assert f.read(5) == b'mark\n'
           f          = <pyarrow.MemoryMappedFile closed=True own_file=False 
is_seekable=True is_writable=False is_readable=True>
           file_factory = <cyfunction memory_map at 0xf228e778>
           filename   = 
'/var/tmp/portage/dev-python/pyarrow-15.0.0/temp/test_iozl2wxbou'
   pyarrow/io.pxi:409: in pyarrow.lib.NativeFile.read
       ???
   pyarrow/error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OSError: Read out of bounds (offset = 4294967301, size = 5) in file of 
size 10
   
   
   pyarrow/error.pxi:91: OSError
   ____________________________________________ 
TestConvertStructTypes.test_from_numpy_nested 
____________________________________________
   
   self = <pyarrow.tests.test_pandas.TestConvertStructTypes object at 
0xeb535d90>
   
       def test_from_numpy_nested(self):
           # Note: an object field inside a struct
           dt = np.dtype([('x', np.dtype([('xx', np.int8),
                                          ('yy', np.bool_)])),
                          ('y', np.int16),
                          ('z', np.object_)])
           # Note: itemsize is not a multiple of sizeof(object)
   >       assert dt.itemsize == 12
   E       AssertionError: assert 8 == 12
   E        +  where 8 = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', 
'<i2'), ('z', 'O')]).itemsize
   
   dt         = dtype([('x', [('xx', 'i1'), ('yy', '?')]), ('y', '<i2'), ('z', 
'O')])
   self       = <pyarrow.tests.test_pandas.TestConvertStructTypes object at 
0xeb535d90>
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_pandas.py:2604:
 AssertionError
   _________________________________________________________ test_schema_sizeof 
__________________________________________________________
   
       def test_schema_sizeof():
           schema = pa.schema([
               pa.field('foo', pa.int32()),
               pa.field('bar', pa.string()),
           ])
       
   >       assert sys.getsizeof(schema) > 30
   E       assert 28 > 30
   E        +  where 28 = <built-in function getsizeof>(foo: int32\nbar: string)
   E        +    where <built-in function getsizeof> = sys.getsizeof
   
   schema     = foo: int32
   bar: string
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/test_schema.py:684:
 AssertionError
   ____________________________________________________ 
test_pandas_roundtrip_string 
_____________________________________________________
   
       @pytest.mark.pandas
       def test_pandas_roundtrip_string():
           # See https://github.com/pandas-dev/pandas/issues/50554
           if Version(pd.__version__) < Version("1.6"):
               pytest.skip("Column.size() bug in pandas")
       
           arr = ["a", "", "c"]
           table = pa.table({"a": pa.array(arr)})
       
           from pandas.api.interchange import (
               from_dataframe as pandas_from_dataframe
           )
       
           pandas_df = pandas_from_dataframe(table)
   >       result = pi.from_dataframe(pandas_df)
   
   arr        = ['a', '', 'c']
   pandas_df  =    a
   0  a
   1   
   2  c
   pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
   table      = pyarrow.Table
   a: string
   ----
   a: [["a","","c"]]
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:159:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
 in from_dataframe
       return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           allow_copy = True
           df         =    a
   0  a
   1   
   2  c
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
 in _from_dataframe
       batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
           allow_copy = True
           batches    = []
           chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda1c61f0>
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda1c61f0>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
 in protocol_df_chunk_to_pyarrow
       columns[name] = column_to_array(col, allow_copy)
           allow_copy = True
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda1c65b0>
           columns    = {}
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda1c61f0>
           dtype      = <DtypeKind.STRING: 21>
           name       = 'a'
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
 in column_to_array
       data = buffers_to_array(buffers, data_type,
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 
3879523528, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda1c65b0>
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
 in buffers_to_array
       data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
           _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 
3879523528, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           data_buff  = PandasBuffer({'bufsize': 2, 'ptr': 3879523528, 
'device': 'CPU'})
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
           length     = 3
           offset     = 0
           offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1530035680, 
'device': 'CPU'})
           offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
           validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1529980112, 
'device': 'CPU'})
           validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OverflowError: Python int too large to convert to C ssize_t
   
   
   pyarrow/io.pxi:1990: OverflowError
   _________________________________________________ 
test_pandas_roundtrip_large_string 
__________________________________________________
   
       @pytest.mark.pandas
       def test_pandas_roundtrip_large_string():
           # See https://github.com/pandas-dev/pandas/issues/50554
           if Version(pd.__version__) < Version("1.6"):
               pytest.skip("Column.size() bug in pandas")
       
           arr = ["a", "", "c"]
           table = pa.table({"a_large": pa.array(arr, type=pa.large_string())})
       
           from pandas.api.interchange import (
               from_dataframe as pandas_from_dataframe
           )
       
           if Version(pd.__version__) >= Version("2.0.1"):
               pandas_df = pandas_from_dataframe(table)
   >           result = pi.from_dataframe(pandas_df)
   
   arr        = ['a', '', 'c']
   pandas_df  =   a_large
   0       a
   1        
   2       c
   pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
   table      = pyarrow.Table
   a_large: large_string
   ----
   a_large: [["a","","c"]]
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:189:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
 in from_dataframe
       return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           allow_copy = True
           df         =   a_large
   0       a
   1        
   2       c
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
 in _from_dataframe
       batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
           allow_copy = True
           batches    = []
           chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda103a10>
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda103a10>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
 in protocol_df_chunk_to_pyarrow
       columns[name] = column_to_array(col, allow_copy)
           allow_copy = True
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda1033d0>
           columns    = {}
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda103a10>
           dtype      = <DtypeKind.STRING: 21>
           name       = 'a_large'
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
 in column_to_array
       data = buffers_to_array(buffers, data_type,
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 
3879522800, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda1033d0>
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
 in buffers_to_array
       data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
           _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 
3879522800, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           data_buff  = PandasBuffer({'bufsize': 2, 'ptr': 3879522800, 
'device': 'CPU'})
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
           length     = 3
           offset     = 0
           offset_buff = PandasBuffer({'bufsize': 32, 'ptr': 1480303312, 
'device': 'CPU'})
           offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
           validity_buff = PandasBuffer({'bufsize': 3, 'ptr': 1478277616, 
'device': 'CPU'})
           validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OverflowError: Python int too large to convert to C ssize_t
   
   
   pyarrow/io.pxi:1990: OverflowError
   ______________________________________________ 
test_pandas_roundtrip_string_with_missing 
______________________________________________
   
       @pytest.mark.pandas
       def test_pandas_roundtrip_string_with_missing():
           # See https://github.com/pandas-dev/pandas/issues/50554
           if Version(pd.__version__) < Version("1.6"):
               pytest.skip("Column.size() bug in pandas")
       
           arr = ["a", "", "c", None]
           table = pa.table({"a": pa.array(arr),
                             "a_large": pa.array(arr, type=pa.large_string())})
       
           from pandas.api.interchange import (
               from_dataframe as pandas_from_dataframe
           )
       
           if Version(pd.__version__) >= Version("2.0.2"):
               pandas_df = pandas_from_dataframe(table)
   >           result = pi.from_dataframe(pandas_df)
   
   arr        = ['a', '', 'c', None]
   pandas_df  =      a a_large
   0    a       a
   1             
   2    c       c
   3  NaN     NaN
   pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
   table      = pyarrow.Table
   a: string
   a_large: large_string
   ----
   a: [["a","","c",null]]
   a_large: [["a","","c",null]]
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:227:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
 in from_dataframe
       return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           allow_copy = True
           df         =      a a_large
   0    a       a
   1             
   2    c       c
   3  NaN     NaN
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
 in _from_dataframe
       batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
           allow_copy = True
           batches    = []
           chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda15b850>
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda15b850>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
 in protocol_df_chunk_to_pyarrow
       columns[name] = column_to_array(col, allow_copy)
           allow_copy = True
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda103210>
           columns    = {}
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xda15b850>
           dtype      = <DtypeKind.STRING: 21>
           name       = 'a'
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
 in column_to_array
       data = buffers_to_array(buffers, data_type,
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 
3879523744, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda103210>
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
 in buffers_to_array
       data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
           _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 2, 'ptr': 
3879523744, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           data_buff  = PandasBuffer({'bufsize': 2, 'ptr': 3879523744, 
'device': 'CPU'})
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
           length     = 4
           offset     = 0
           offset_buff = PandasBuffer({'bufsize': 40, 'ptr': 1469510752, 
'device': 'CPU'})
           offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
           validity_buff = PandasBuffer({'bufsize': 4, 'ptr': 1475420176, 
'device': 'CPU'})
           validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OverflowError: Python int too large to convert to C ssize_t
   
   
   pyarrow/io.pxi:1990: OverflowError
   __________________________________________________ 
test_pandas_roundtrip_categorical 
__________________________________________________
   
       @pytest.mark.pandas
       def test_pandas_roundtrip_categorical():
           if Version(pd.__version__) < Version("2.0.2"):
               pytest.skip("Bitmasks not supported in pandas interchange 
implementation")
       
           arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", None]
           table = pa.table(
               {"weekday": pa.array(arr).dictionary_encode()}
           )
       
           from pandas.api.interchange import (
               from_dataframe as pandas_from_dataframe
           )
           pandas_df = pandas_from_dataframe(table)
   >       result = pi.from_dataframe(pandas_df)
   
   arr        = ['Mon', 'Tue', 'Mon', 'Wed', 'Mon', 'Thu', 'Fri', 'Sat', None]
   pandas_df  =   weekday
   0     Mon
   1     Tue
   2     Mon
   3     Wed
   4     Mon
   5     Thu
   6     Fri
   7     Sat
   8     NaN
   pandas_from_dataframe = <function from_dataframe at 0xebbaa398>
   table      = pyarrow.Table
   weekday: dictionary<values=string, indices=int32, ordered=0>
   ----
   weekday: [  -- dictionary:
   ["Mon","Tue","Wed","Thu","Fri","Sat"]  -- indices:
   [0,1,0,2,0,3,4,5,null]]
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:257:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
 in from_dataframe
       return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           allow_copy = True
           df         =   weekday
   0     Mon
   1     Tue
   2     Mon
   3     Wed
   4     Mon
   5     Thu
   6     Fri
   7     Sat
   8     NaN
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136:
 in _from_dataframe
       batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
           allow_copy = True
           batches    = []
           chunk      = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xd9e217f0>
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xd9e217f0>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:186:
 in protocol_df_chunk_to_pyarrow
       columns[name] = categorical_column_to_dictionary(col, allow_copy)
           allow_copy = True
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda180550>
           columns    = {}
           df         = <pandas.core.interchange.dataframe.PandasDataFrameXchg 
object at 0xd9e217f0>
           dtype      = <DtypeKind.CATEGORICAL: 23>
           name       = 'weekday'
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:293:
 in categorical_column_to_dictionary
       dictionary = column_to_array(cat_column)
           allow_copy = True
           cat_column = <pandas.core.interchange.column.PandasColumn object at 
0xda1801d0>
           categorical = {'categories': 
<pandas.core.interchange.column.PandasColumn object at 0xda1801d0>,
    'is_dictionary': True,
    'is_ordered': False}
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda180550>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
 in column_to_array
       data = buffers_to_array(buffers, data_type,
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 18, 'ptr': 
3659006432, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           col        = <pandas.core.interchange.column.PandasColumn object at 
0xda1801d0>
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
 in buffers_to_array
       data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
           _          = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           allow_copy = True
           buffers    = {'data': (PandasBuffer({'bufsize': 18, 'ptr': 
3659006432, 'device': 'CPU'}),
             (<DtypeKind.STRING: 21>, 8, 'u', '=')),
    'offsets': (PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 'device': 
'CPU'}),
                (<DtypeKind.INT: 0>, 64, 'l', '=')),
    'validity': (PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 'device': 
'CPU'}),
                 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
           data_buff  = PandasBuffer({'bufsize': 18, 'ptr': 3659006432, 
'device': 'CPU'})
           data_type  = (<DtypeKind.STRING: 21>, 8, 'u', '=')
           describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
           length     = 6
           offset     = 0
           offset_buff = PandasBuffer({'bufsize': 56, 'ptr': 1466456352, 
'device': 'CPU'})
           offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
           validity_buff = PandasBuffer({'bufsize': 6, 'ptr': 1477427216, 
'device': 'CPU'})
           validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OverflowError: Python int too large to convert to C ssize_t
   
   
   pyarrow/io.pxi:1990: OverflowError
   ________________________________________________________ 
test_empty_dataframe _________________________________________________________
   
       def test_empty_dataframe():
           schema = pa.schema([('col1', pa.int8())])
           df = pa.table([[]], schema=schema)
           dfi = df.__dataframe__()
   >       assert pi.from_dataframe(dfi) == df
   
   df         = pyarrow.Table
   col1: int8
   ----
   col1: [[]]
   dfi        = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 
0xd98381d0>
   schema     = col1: int8
   
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/tests/interchange/test_conversion.py:522:
 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113:
 in from_dataframe
       return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
           allow_copy = True
           df         = <pyarrow.interchange.dataframe._PyArrowDataFrame object 
at 0xd98381d0>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:140:
 in _from_dataframe
       batch = protocol_df_chunk_to_pyarrow(df)
           allow_copy = True
           batches    = []
           df         = <pyarrow.interchange.dataframe._PyArrowDataFrame object 
at 0xd96e41b0>
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182:
 in protocol_df_chunk_to_pyarrow
       columns[name] = column_to_array(col, allow_copy)
           allow_copy = True
           col        = <pyarrow.interchange.column._PyArrowColumn object at 
0xd96a6650>
           columns    = {}
           df         = <pyarrow.interchange.dataframe._PyArrowDataFrame object 
at 0xd96e41b0>
           dtype      = <DtypeKind.INT: 0>
           name       = 'col1'
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214:
 in column_to_array
       data = buffers_to_array(buffers, data_type,
           allow_copy = True
           buffers    = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr': 
4122363392, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 8, 'c', '=')),
    'offsets': None,
    'validity': None}
           col        = <pyarrow.interchange.column._PyArrowColumn object at 
0xd96a6650>
           data_type  = (<DtypeKind.INT: 0>, 8, 'c', '=')
   
../work/apache-arrow-15.0.0/python-python3_11/install/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396:
 in buffers_to_array
       data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
           _          = (<DtypeKind.INT: 0>, 8, 'c', '=')
           allow_copy = True
           buffers    = {'data': (PyArrowBuffer({'bufsize': 0, 'ptr': 
4122363392, 'device': 'CPU'}),
             (<DtypeKind.INT: 0>, 8, 'c', '=')),
    'offsets': None,
    'validity': None}
           data_buff  = PyArrowBuffer({'bufsize': 0, 'ptr': 4122363392, 
'device': 'CPU'})
           data_type  = (<DtypeKind.INT: 0>, 8, 'c', '=')
           describe_null = (<ColumnNullType.NON_NULLABLE: 0>, None)
           length     = 0
           offset     = 0
           offset_buff = None
           validity_buff = None
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OverflowError: Python int too large to convert to C ssize_t
   
   
   pyarrow/io.pxi:1990: OverflowError
   ```
   </details>
   
   Full build & test log (2.5M): 
[pyarrow.txt](https://github.com/apache/arrow/files/14343715/pyarrow.txt)
   
   This is arrow 15.0.0 on Gentoo, with x86 systemd-nspawn container. I've used 
`-O2 -march=pentium-m -mfpmath=sse -pipe` as compiler flags, to rule out 
i387-specific issues.
   
   ```
   >>> pyarrow.show_info()
   pyarrow version info
   --------------------
   Package kind              : not indicated
   Arrow C++ library version : 15.0.0  
   Arrow C++ compiler        : GNU 13.2.1
   Arrow C++ compiler flags  : -O2 -march=pentium-m -mfpmath=sse -pipe
   Arrow C++ git revision    :         
   Arrow C++ git description :         
   Arrow C++ build type      : relwithdebinfo
   
   Platform:
     OS / Arch           : Linux x86_64
     SIMD Level          : avx2    
     Detected SIMD Level : avx2    
   
   Memory:
     Default backend     : system  
     Bytes allocated     : 0 bytes 
     Max memory          : 0 bytes 
     Supported Backends  : system  
   
   Optional modules:
     csv                 : Enabled 
     cuda                : -       
     dataset             : Enabled 
     feather             : Enabled 
     flight              : -       
     fs                  : Enabled 
     gandiva             : -       
     json                : Enabled 
     orc                 : -       
     parquet             : Enabled 
   
   Filesystems:
     GcsFileSystem       : -       
     HadoopFileSystem    : Enabled 
     S3FileSystem        : -       
   
   Compression Codecs:
     brotli              : Enabled 
     bz2                 : Enabled 
     gzip                : Enabled 
     lz4_frame           : Enabled 
     lz4                 : Enabled 
     snappy              : Enabled 
     zstd                : Enabled 
   ```
   
   Some of these might be problems inside pandas. I'm going to file a bug about 
the test failures there in a minute, and link it here afterwards.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to