edponce commented on issue #11559:
URL: https://github.com/apache/arrow/issues/11559#issuecomment-954659954
Recall that Arrow's format is column-oriented so accessing rows is not an
efficient operation. I do not expect a meaningful difference between using the
Python and Cython APIs for iterating across rows (you would need to measure).
Here are some code snippets using both APIs. Also, keep in mind that Cython's
API is smaller and thus limited compared to Python's.
_Note_: The following examples assume there are no NULL values. You would
need to add those checks, so consider them as templates.
**VERSION 1: Cython/Python**
**Cython code**
```python
from pyarrow.lib cimport *
# Helper function to extract a Scalar object from a column (CChunkedArray)
cdef shared_ptr[CScalar]
get_scalar_from_chunked_array(shared_ptr[CChunkedArray] c_chunked_array, int
index):
cdef:
shared_ptr[CArray] c_array
CResult[shared_ptr[CScalar]] result
# Iterate through chunks/rows until finding the corresponding index
chunked_array = c_chunked_array.get()
for ichunk in range(chunked_array.num_chunks()):
c_array = chunked_array.chunk(ichunk)
array = c_array.get()
if index < array.length():
result = array.GetScalar(index)
# NOTE: GetResultValue is exposed to Cython directly from here
#
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L63
return GetResultValue(result)
# Update index relative to next chunk
index = index - array.length()
def iterate_table(obj):
cdef:
shared_ptr[CTable] c_table = pyarrow_unwrap_table(obj)
shared_ptr[CChunkedArray] c_chunked_array
shared_ptr[CScalar] c_scalar
table = c_table.get()
if table == NULL:
raise TypeError("not a table")
# Arrow format is column-oriented, so iterate first on rows then columns
for irow in range(table.num_rows()):
for icol in range(table.num_columns()):
c_chunked_array = table.column(icol)
c_scalar = get_scalar_from_chunked_array(c_chunked_array, irow)
yield pyarrow_wrap_scalar(c_scalar)
```
**Python code**
```python
import pyarrow as pa
import pandas as pd
import example
df = pd.DataFrame({
'date': pd.date_range(start='2020-01-01 00:00:00', periods=3,
freq='1min'),
'name': ['jack', 'tim', 'frank'],
'age': [32, 25, 65],
'weight': [66.46, 84.11, 71.52]
})
table = pa.Table.from_pandas(df)
# Print entire Table
print(table)
print()
# Print Table row-by-row
for i, scalar in enumerate(example.iterate_table(table)):
if i % table.num_columns == 0:
print()
print(scalar, end=', ')
print()
```
**VERSION 2: Python API**
```python
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range(start='2020-01-01 00:00:00', periods=3,
freq='1min'),
'name': ['jack', 'tim', 'frank'],
'age': [32, 25, 65],
'weight': [66.46, 84.11, 71.52]
})
table = pa.Table.from_pandas(df)
# Print entire Table
print(table)
print()
# Print Table row-by-row
for irow in range(table.num_rows):
for chunked_array in table.itercolumns():
scalar = chunked_array[irow]
print(scalar, end=', ')
print()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]