Re: [PR] feat(python): Add column-wise buffer builder [arrow-nanoarrow]

via GitHub Thu, 16 May 2024 08:30:18 -0700


paleolimbot commented on code in PR #464:
URL: https://github.com/apache/arrow-nanoarrow/pull/464#discussion_r1603601869



##########
python/src/nanoarrow/visitor.py:
##########
@@ -15,68 +15,186 @@
 # specific language governing permissions and limitations
 # under the License.
 
-from typing import Any, List, Sequence, Tuple, Union
+from typing import Any, Callable, List, Sequence, Tuple, Union
 
-from nanoarrow._lib import CArrayView
+from nanoarrow._lib import CArrayView, CArrowType, CBuffer, CBufferBuilder
 from nanoarrow.c_array_stream import c_array_stream
+from nanoarrow.c_schema import c_schema_view
 from nanoarrow.iterator import ArrayViewBaseIterator, PyIterator
 from nanoarrow.schema import Type
 
 
-def to_pylist(obj, schema=None) -> List:
-    """Convert ``obj`` to a ``list()` of Python objects
+class ArrayViewVisitable:
+    """Mixin class providing conversion methods based on visitors
+
+    Can be used with classes that implement ``__arrow_c_stream__()``
+    or ``__arrow_c_array__()``.
+    """
+
+    def to_pylist(self) -> List:
+        """Convert to a ``list()`` of Python objects
+
+        Computes an identical value to ``list(iter_py())`` but can be much
+        faster.
+
+        Examples
+        --------
+
+        >>> import nanoarrow as na
+        >>> from nanoarrow import visitor
+        >>> array = na.Array([1, 2, 3], na.int32())
+        >>> array.to_pylist()
+        [1, 2, 3]
+        """
+        return ListBuilder.visit(self)
+
+    def to_column_list(self, handle_nulls=None) -> Tuple[List[str], 
List[Sequence]]:
+        """Convert to a ``list()` of contiguous sequences
+
+        Converts a stream of struct arrays into its column-wise representation
+        according to :meth:`to_column`.
+
+        Paramters
+        ---------
+        handle_nulls : callable
+            A function returning a sequence based on a validity bytemap and a
+            contiguous buffer of values (e.g., the callable returned by
+            :meth:`nulls_as_sentinel`).
+
+        Examples
+        --------
+
+        >>> import nanoarrow as na
+        >>> import pyarrow as pa
+        >>> batch = pa.record_batch({"col1": [1, 2, 3], "col2": ["a", "b", 
"c"]})
+        >>> names, columns = na.Array(batch).to_column_list()
+        >>> names
+        ['col1', 'col2']
+        >>> columns
+        [nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']]
+        """
+        return ColumnsBuilder.visit(self, handle_nulls=handle_nulls)
+
+    def to_column(self, handle_nulls=None) -> Sequence:
+        """Convert to a contiguous sequence
+
+        Converts a stream of arrays into a columnar representation
+        such that each column is either a contiguous buffer or a ``list()``.
+        Integer, float, and interval arrays are currently converted to their

Review Comment:
   > What is the reason interval arrays are returned as buffer? (just because 
there is no obvious python object?)
   
   The technical answer is that `schema_view.buffer_format` returns `"iiq"` 
(i.e., not `None`)...I never really looked at what pyarrow does here but I see 
now that it returns a named tuple if converted to a list. The last time I tried 
`to_numpy()` on a pyarrow interval I got a crash ( 
https://github.com/apache/arrow/issues/41326 ). I'm pretty happy to do anything 
here (or make a breaking change later to handle it properly).
   
   > And why are other primitive fixed-width types like timestamp not returned 
as buffers?
   
   If just the storage were returned it would be lossy (i.e., `pd.Series()` 
would do the wrong thing). There's no way to communicate what to do here 
without invoking numpy or pandas-specific logic, so a list of Python objects is 
maybe a safer default.
   
   > Should we make this user configurable?
   
   Totally (and also make Python object conversion configurable), but I'm not 
sure exactly how to do that yet. If somebody wanted to do this *today*, I'd 
suggest subclassing the visitor:
   
   ```python
   import nanoarrow as na
   import pyarrow as pa
   from datetime import datetime
   from nanoarrow.visitor import ColumnsBuilder, NullableColumnBuilder
   
   
   class CustomColumnsBuilder(ColumnsBuilder):
   
       def _resolve_child_visitor(self, child_schema, child_array_view, 
handle_nulls):
           if na.Schema(child_schema).type == na.Type.TIMESTAMP:
               return NullableColumnBuilder(na.int64(), 
handle_nulls=handle_nulls)
           else:
               return super()._resolve_child_visitor(
                   child_schema, child_array_view, handle_nulls
               )
   
   batch = pa.record_batch({"ts": [datetime.now()]})
   CustomColumnsBuilder.visit(batch)
   #> (['ts'], [nanoarrow.c_lib.CBuffer(int64[8 b] 1715862463542238)])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(python): Add column-wise buffer builder [arrow-nanoarrow]

Reply via email to