[GitHub] [arrow] AlenkaF commented on a diff in pull request #14804: ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table

GitBox Thu, 15 Dec 2022 07:17:09 -0800


AlenkaF commented on code in PR #14804:
URL: https://github.com/apache/arrow/pull/14804#discussion_r1049767410



##########
python/pyarrow/interchange/dataframe.py:
##########
@@ -0,0 +1,190 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from __future__ import annotations
+from typing import (
+    Any,
+    Iterable,
+    Optional,
+    Sequence,
+)
+
+import pyarrow as pa
+
+from pyarrow.interchange.column import _PyArrowColumn
+
+
+class _PyArrowDataFrame:
+    """
+    A data frame class, with only the methods required by the interchange
+    protocol defined.
+    A "data frame" represents an ordered collection of named columns.
+    A column's "name" must be a unique string.
+    Columns may be accessed by name or by position.
+    This could be a public data frame class, or an object with the methods and
+    attributes defined on this DataFrame class could be returned from the
+    ``__dataframe__`` method of a public data frame class in a library adhering
+    to the dataframe interchange protocol specification.
+    """
+
+    def __init__(
+        self, df: pa.Table, nan_as_null: bool = False, allow_copy: bool = True
+    ) -> None:
+        """
+        Constructor - an instance of this (private) class is returned from
+        `pa.Table.__dataframe__`.
+        """
+        self._df = df
+        # ``nan_as_null`` is a keyword intended for the consumer to tell the
+        # producer to overwrite null values in the data with ``NaN`` (or
+        # ``NaT``). This currently has no effect; once support for nullable
+        # extension dtypes is added, this value should be propagated to
+        # columns.
+        self._nan_as_null = nan_as_null
+        self._allow_copy = allow_copy
+
+    def __dataframe__(
+        self, nan_as_null: bool = False, allow_copy: bool = True
+    ) -> _PyArrowDataFrame:
+        """
+        Construct a new exchange object, potentially changing the parameters.
+        ``nan_as_null`` is a keyword intended for the consumer to tell the
+        producer to overwrite null values in the data with ``NaN``.
+        It is intended for cases where the consumer does not support the bit
+        mask or byte mask that is the producer's native representation.
+        ``allow_copy`` is a keyword that defines whether or not the library is
+        allowed to make a copy of the data. For example, copying data would be
+        necessary if a library supports strided buffers, given that this
+        protocol specifies contiguous buffers.
+        """
+        return _PyArrowDataFrame(self._df, nan_as_null, allow_copy)
+
+    @property
+    def metadata(self) -> dict[str, Any]:
+        """
+        The metadata for the data frame, as a dictionary with string keys. The
+        contents of `metadata` may be anything, they are meant for a library
+        to store information that it needs to, e.g., roundtrip losslessly or
+        for two implementations to share data that is not (yet) part of the
+        interchange protocol specification. For avoiding collisions with other
+        entries, please add name the keys with the name of the library
+        followed by a period and the desired name, e.g, ``pandas.indexcol``.
+        """
+        # The metadata for the data frame, as a dictionary with string keys.
+        # Add schema metadata here (pandas metadata or custom metadata)
+        if self._df.schema.metadata:
+            schema_metadata = {"pyarrow." + k.decode('utf8'): v.decode('utf8')
+                               for k, v in self._df.schema.metadata.items()}
+            return schema_metadata
+        else:
+            return {}
+
+    def num_columns(self) -> int:
+        """
+        Return the number of columns in the DataFrame.
+        """
+        return self._df.num_columns
+
+    def num_rows(self) -> int:
+        """
+        Return the number of rows in the DataFrame, if available.
+        """
+        return self._df.num_rows
+
+    def num_chunks(self) -> int:
+        """
+        Return the number of chunks the DataFrame consists of.
+        """
+        return self._df.column(0).num_chunks

Review Comment:
   > With the current implementation, this is always 1 ?
   
   No, that shouldn't be the case. With the dataframe interchange protocol 
class, Column always gets combined into one chunk. But here we are calling 
PyArrow array which is not a part of the dataframe interchange protocol 
implementation so we will always get the correct result of the chunks.
   
   ```python
   >>> table = pa.table([pa.chunked_array([[2, 2, 4], [4, 5, 100]])], 
names=["Chunked"])
   >>> table
   pyarrow.Table
   Chunked: int64
   ----
   Chunked: [[2,2,4],[4,5,100]]
   >>> table_interchange = table.__dataframe__()
   
   
   >>> table.column(0).num_chunks
   2
   >>> table_interchange.get_column(0).num_chunks()
   1
   ```
   
   I will test that when adding the `PyArrow <-> PyArrow` roundtrips.
   
   > (it's also not super clear what the "number of chunks" for a DataFrame 
actually is, if each column can have a different number of chunks ..)
   
   Oh, haven't thought of that. Looking at the design concept I think the 
protocol implies dataframe and its columns are chunked in the same way, see
   
https://data-apis.org/dataframe-protocol/latest/design_requirements.html?highlight=validity#conceptual-model-of-a-dataframe
   
   And I took the first column of the dataframe as the number of chunks for all 
other columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on a diff in pull request #14804: ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table

Reply via email to