edponce commented on issue #11559:
URL: https://github.com/apache/arrow/issues/11559#issuecomment-954659954


   Recall that Arrow's format is column-oriented so accessing rows is not an 
efficient operation. I do not expect a meaningful difference between using the 
Python and Cython APIs for iterating across rows (you would need to measure). 
Here are some code snippets using both APIs. Also, keep in mind that Cython's 
API is smaller and thus limited compared to Python's.
   
   _Note_: The following examples assume there are no NULL values. You would 
need to add those checks, so consider them as templates.
   
   **VERSION 1: Cython/Python**
   **Cython code**
   ```python
   from pyarrow.lib cimport *
   
   # Helper function to extract a Scalar object from a column (CChunkedArray)
   cdef shared_ptr[CScalar] 
get_scalar_from_chunked_array(shared_ptr[CChunkedArray] c_chunked_array, int 
index):
       cdef:
            shared_ptr[CArray] c_array
            CResult[shared_ptr[CScalar]] result
   
       # Iterate through chunks/rows until finding the corresponding index
       chunked_array = c_chunked_array.get()
       for ichunk in range(chunked_array.num_chunks()):
           c_array = chunked_array.chunk(ichunk)
           array = c_array.get()
   
           if index < array.length():
               result = array.GetScalar(index)
               # NOTE: GetResultValue is exposed to Cython directly from here
               # 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L63
               return GetResultValue(result)
   
           # Update index relative to next chunk
           index = index - array.length()
   
   
   def iterate_table(obj):
       cdef:
            shared_ptr[CTable] c_table = pyarrow_unwrap_table(obj)
            shared_ptr[CChunkedArray] c_chunked_array
            shared_ptr[CScalar] c_scalar
   
       table = c_table.get()
       if table == NULL:
           raise TypeError("not a table")
   
       # Arrow format is column-oriented, so iterate first on rows then columns
       for irow in range(table.num_rows()):
           for icol in range(table.num_columns()):
               c_chunked_array = table.column(icol)
               c_scalar = get_scalar_from_chunked_array(c_chunked_array, irow)
               yield pyarrow_wrap_scalar(c_scalar)
   ```
   **Python code**
   ```python
   import pyarrow as pa
   import pandas as pd
   import example
   
   df = pd.DataFrame({
       'date': pd.date_range(start='2020-01-01 00:00:00', periods=3, 
freq='1min'),
       'name': ['jack', 'tim', 'frank'],
       'age': [32, 25, 65],
       'weight': [66.46, 84.11, 71.52]
   })
   
   table = pa.Table.from_pandas(df)
   
   # Print entire Table
   print(table)
   print()
   
   # Print Table row-by-row
   for i, scalar in enumerate(example.iterate_table(table)):
       if i % table.num_columns == 0:
           print()
       print(scalar, end=', ')
   print()
   ```
   
   **VERSION 2: Python API**
   ```python
   import pyarrow as pa
   import pandas as pd
   
   df = pd.DataFrame({
       'date': pd.date_range(start='2020-01-01 00:00:00', periods=3, 
freq='1min'),
       'name': ['jack', 'tim', 'frank'],
       'age': [32, 25, 65],
       'weight': [66.46, 84.11, 71.52]
   })
   
   table = pa.Table.from_pandas(df)
   
   # Print entire Table
   print(table)
   print()
   
   # Print Table row-by-row
   for irow in range(table.num_rows):
       for chunked_array in table.itercolumns():
           scalar = chunked_array[irow]
           print(scalar, end=', ')
       print()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to