Re: [PR] GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` [arrow]

via GitHub Mon, 17 Feb 2025 04:54:11 -0800


pitrou commented on code in PR #45471:
URL: https://github.com/apache/arrow/pull/45471#discussion_r1958168024



##########
python/pyarrow/array.pxi:
##########
@@ -2286,12 +2302,28 @@ cdef class MonthDayNanoIntervalArray(Array):
     Concrete class for Arrow arrays of interval[MonthDayNano] type.
     """
 
-    def to_pylist(self):
+    def to_pylist(self, *, maps_as_pydicts=None):
         """
         Convert to a list of native Python objects.
 
         pyarrow.MonthDayNano is used as the native representation.
 
+        Parameters
+        ----------
+        maps_as_pydicts : str, optional, default `None`
+            Valid values are `None`, 'lossy', or 'strict'.

Review Comment:
   This is the docstring for `MonthDayNanoIntervalArray`, so `maps_as_pydicts` 
would simply be ignored here. How about just stating that?



##########
python/pyarrow/table.pxi:
##########
@@ -1349,10 +1349,26 @@ cdef class ChunkedArray(_PandasConvertible):
         for i in range(self.num_chunks):
             yield self.chunk(i)
 
-    def to_pylist(self):
+    def to_pylist(self, *, maps_as_pydicts=None):
         """
         Convert to a list of native Python objects.
 
+        Parameters
+        ----------
+        maps_as_pydicts : str, optional, default `None`
+            Valid values are `None`, 'lossy', or 'strict'.
+            The default behavior (`None`), is to convert Arrow Map arrays to
+            Python association lists (list-of-tuples) in the same order as the
+            Arrow Map, as in [(key1, value1), (key2, value2), ...].
+
+            If 'lossy' or 'strict', convert Arrow Map arrays to native Python 
dicts.
+            This can change the ordering of (key, value) pairs, and will

Review Comment:
   Same comment re: ordering



##########
python/pyarrow/scalar.pxi:
##########
@@ -880,12 +1348,49 @@ cdef class MapScalar(ListScalar):
         for k, v in zip(arr.field(self.type.key_field.name), 
arr.field(self.type.item_field.name)):
             yield (k.as_py(), v.as_py())
 
-    def as_py(self):
+    def as_py(self, *, maps_as_pydicts=None):
         """
-        Return this value as a Python list.
+        Return this value as a Python list or dict, depending on 
'maps_as_pydicts'.
+
+        Parameters
+        ----------
+        maps_as_pydicts : str, optional, default `None`
+            Valid values are `None`, 'lossy', or 'strict'.
+            The default behavior (`None`), is to convert Arrow Map arrays to
+            Python association lists (list-of-tuples) in the same order as the
+            Arrow Map, as in [(key1, value1), (key2, value2), ...].
+
+            If 'lossy' or 'strict', convert Arrow Map arrays to native Python 
dicts.
+            This can change the ordering of (key, value) pairs, and will
+            deduplicate multiple keys, resulting in a possible loss of data.
+
+            If 'lossy', this key deduplication results in a warning printed
+            when detected. If 'strict', this instead results in an exception
+            being raised when detected.
         """
-        cdef CStructScalar* sp = <CStructScalar*> self.wrapped.get()
-        return list(self) if sp.is_valid else None
+        if maps_as_pydicts not in (None, "lossy", "strict"):
+            raise ValueError(
+                "Invalid value for 'maps_as_pydicts': "
+                + "valid values are 'lossy', 'strict' or `None` (default). "
+                + f"Received '{maps_as_pydicts}'."
+            )
+        if not self.is_valid:
+            return None
+        if not maps_as_pydicts:
+            return list(self)
+        result_dict = {}
+        for key, value in self:
+            if key in result_dict:
+                if maps_as_pydicts == "strict":
+                    raise ValueError(

Review Comment:
   I would make this a `KeyError`. Also, the message should perhaps contain the 
duplicate key?



##########
python/pyarrow/tests/test_scalars.py:
##########
@@ -786,6 +786,21 @@ def test_map(pickle_module):
     restored = pickle_module.loads(pickle_module.dumps(s))
     assert restored.equals(s)
 
+    assert s.as_py(maps_as_pydicts="strict") == {'a': 1, 'b': 2}
+
+
+def test_map_duplicate_fields():
+    ty = pa.map_(pa.string(), pa.int8())
+    v = [('a', 1), ('a', 2)]
+    s = pa.scalar(v, type=ty)
+
+    assert s.as_py(maps_as_pydicts=None) == v
+
+    with pytest.raises(ValueError):
+        assert s.as_py(maps_as_pydicts="strict")
+
+    assert s.as_py(maps_as_pydicts="lossy") == {'a': 2}

Review Comment:
   Can we check that a warning is actually emitted? See 
[`pytest.warns`](https://docs.pytest.org/en/stable/how-to/capture-warnings.html#warns)



##########
python/pyarrow/scalar.pxi:
##########
@@ -148,7 +149,26 @@ cdef class Scalar(_Weakrefable):
     def __reduce__(self):
         return scalar, (self.as_py(), self.type)
 
-    def as_py(self):
+    def as_py(self, *, maps_as_pydicts=None):
+        """
+        Return this value as a Python representation.
+
+        Parameters
+        ----------
+        maps_as_pydicts : str, optional, default `None`
+            Valid values are `None`, 'lossy', or 'strict'.

Review Comment:
   Like the `MonthDayNanoIntervalArray` comment, we don't need to explain 
`maps_as_pydicts` for non-nested Scalar subclasses. We can just mention it's 
ignored here.



##########
python/pyarrow/array.pxi:
##########
@@ -1651,16 +1651,32 @@ cdef class Array(_PandasConvertible):
             array = array.copy()
         return array
 
-    def to_pylist(self):
+    def to_pylist(self, *, maps_as_pydicts=None):
         """
         Convert to a list of native Python objects.
 
+        Parameters
+        ----------
+        maps_as_pydicts : str, optional, default `None`
+            Valid values are `None`, 'lossy', or 'strict'.
+            The default behavior (`None`), is to convert Arrow Map arrays to
+            Python association lists (list-of-tuples) in the same order as the
+            Arrow Map, as in [(key1, value1), (key2, value2), ...].
+
+            If 'lossy' or 'strict', convert Arrow Map arrays to native Python 
dicts.
+            This can change the ordering of (key, value) pairs, and will
+            deduplicate multiple keys, resulting in a possible loss of data.

Review Comment:
   I think the ordering comment is obsolete, as Python dicts are ordered 
nowadays. Unless the underlying implementation does something weird, ordering 
should therefore be preserved.
   



##########
python/pyarrow/scalar.pxi:
##########
@@ -880,12 +1348,49 @@ cdef class MapScalar(ListScalar):
         for k, v in zip(arr.field(self.type.key_field.name), 
arr.field(self.type.item_field.name)):
             yield (k.as_py(), v.as_py())
 
-    def as_py(self):
+    def as_py(self, *, maps_as_pydicts=None):
         """
-        Return this value as a Python list.
+        Return this value as a Python list or dict, depending on 
'maps_as_pydicts'.
+
+        Parameters
+        ----------
+        maps_as_pydicts : str, optional, default `None`
+            Valid values are `None`, 'lossy', or 'strict'.
+            The default behavior (`None`), is to convert Arrow Map arrays to
+            Python association lists (list-of-tuples) in the same order as the
+            Arrow Map, as in [(key1, value1), (key2, value2), ...].
+
+            If 'lossy' or 'strict', convert Arrow Map arrays to native Python 
dicts.
+            This can change the ordering of (key, value) pairs, and will
+            deduplicate multiple keys, resulting in a possible loss of data.
+
+            If 'lossy', this key deduplication results in a warning printed
+            when detected. If 'strict', this instead results in an exception
+            being raised when detected.
         """
-        cdef CStructScalar* sp = <CStructScalar*> self.wrapped.get()
-        return list(self) if sp.is_valid else None
+        if maps_as_pydicts not in (None, "lossy", "strict"):
+            raise ValueError(
+                "Invalid value for 'maps_as_pydicts': "
+                + "valid values are 'lossy', 'strict' or `None` (default). "
+                + f"Received '{maps_as_pydicts}'."

Review Comment:
   Nit: it may be more idiomatic to use the repr here
   ```suggestion
                   + f"Received {maps_as_pydicts!r}."
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-39010: [Python] Introduce `maps_as_pydicts` parameter for `to_pylist`, `to_pydict`, `as_py` [arrow]

Reply via email to