[GitHub] [arrow] AlenkaF opened a new pull request, #14804: ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table

GitBox Thu, 01 Dec 2022 07:00:54 -0800


AlenkaF opened a new pull request, #14804:
URL: https://github.com/apache/arrow/pull/14804


   This PR is a continuation of https://github.com/apache/arrow/pull/14613 
which I will close due to issues with building PyArrow locally. This PR 
includes:
   
   ## Producing a `__dataframe__` object
   See the description and examples in the linked PR.
   
   ## Consuming a `__dataframe__` object
   
   - [ ] define column convert methods for different dtypes (int, float, 
string, boolean, datetime, dictionary)
   - [ ] Implement from_dataframe method
   
   Example:
   
   <details>
   
   ```python
   import pandas as pd
   import numpy as np
   import math
   from datetime import datetime as dt
   
   arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", None]
   df = pd.DataFrame(
       {"weekday": arr}
   )
   
   df = df.astype("category")
   
   dfX = df.__dataframe__()
   colX = dfX.get_column(0)
   
   from pyarrow.interchange.from_dataframe import 
categorical_column_to_dictionary
   categorical_column_to_dictionary(colX)
   #(<pyarrow.lib.DictionaryArray object at 0x11f6c69e0>
   #
   # -- dictionary:
   #   [
   #     "Fri",
   #     "Mon",
   #     "Sat",
   #     "Thu",
   #     "Tue",
   #     "Wed"
   #   ]
   # -- indices:
   #   [
   #     1,
   #     4,
   #     1,
   #     5,
   #     1,
   #     3,
   #     0,
   #     2,
   #     null
   #   ], {'data': (PandasBuffer({'bufsize': 9, 'ptr': 4981046848, 'device': 
'CPU'}), (<DtypeKind.INT: 0>, 8, 'c', '|')), 'validity': None, 'offsets': None})
   
   df = pd.DataFrame(
           {"x": [True, True, False],
           "y": [1, 2, 0],
           "z": [9.2, 10.5, math.nan],
           "Y": ["a", "b", None],
           "dt": [None, dt(2007, 7, 14), dt(2007, 7, 15)]}
       )
   dfX = df.__dataframe__()
   
   colX = dfX.get_column(0)
   from pyarrow.interchange.from_dataframe import bool_8_column_to_array
   bool_8_column_to_array(colX)
   # (<pyarrow.lib.BooleanArray object at 0x11f656fa0>
   # [
   #   true,
   #   true,
   #   false
   # ], {'data': (PandasBuffer({'bufsize': 3, 'ptr': 5518890672, 'device': 
'CPU'}), (<DtypeKind.BOOL: 20>, 8, 'b', '|')), 'validity': None, 'offsets': 
None})
   
   colX = dfX.get_column(1)
   from pyarrow.interchange.from_dataframe import column_to_array
   column_to_array(colX)
   # (<pyarrow.lib.Int64Array object at 0x11f683640>
   # [
   #   1,
   #   2,
   #   0
   # ], {'data': (PandasBuffer({'bufsize': 24, 'ptr': 5519025920, 'device': 
'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '=')), 'validity': None, 'offsets': 
None})
   
   colX = dfX.get_column(2)
   from pyarrow.interchange.from_dataframe import column_to_array
   column_to_array(colX)
   # (<pyarrow.lib.DoubleArray object at 0x11f656d00>
   # [
   #   9.2,
   #   10.5,
   #   nan
   # ], {'data': (PandasBuffer({'bufsize': 24, 'ptr': 5249481744, 'device': 
'CPU'}), (<DtypeKind.FLOAT: 2>, 64, 'g', '=')), 'validity': None, 'offsets': 
None})
   
   colX = dfX.get_column(3)
   from pyarrow.interchange.from_dataframe import column_to_array
   column_to_array(colX)
   # (<pyarrow.lib.StringArray object at 0x11f6dc580>
   # [
   #   "a",
   #   "b",
   #   null
   # ], {'data': (PandasBuffer({'bufsize': 2, 'ptr': 4792270768, 'device': 
'CPU'}), (<DtypeKind.STRING: 21>, 8, 'u', '=')), 'validity': 
(PandasBuffer({'bufsize': 3, 'ptr': 5518765136, 'device': 'CPU'}), 
(<DtypeKind.BOOL: 20>, 8, 'b', '=')), 'offsets': (PandasBuffer({'bufsize': 32, 
'ptr': 5518864576, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))})
   
   colX = dfX.get_column(4)
   from pyarrow.interchange.from_dataframe import datetime_column_to_array
   datetime_column_to_array(colX)
   # <pyarrow.lib.TimestampArray object at 0x11f656fa0>
   # [
   #   null,
   #   2007-07-14 00:00:00.000000,
   #   2007-07-15 00:00:00.000000
   # ]
   ```
   
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF opened a new pull request, #14804: ARROW-18152: [Python] DataFrame Interchange Protocol for pyarrow Table

Reply via email to