AlenkaF opened a new pull request, #14804: URL: https://github.com/apache/arrow/pull/14804
This PR is a continuation of https://github.com/apache/arrow/pull/14613 which I will close due to issues with building PyArrow locally. This PR includes: ## Producing a `__dataframe__` object See the description and examples in the linked PR. ## Consuming a `__dataframe__` object - [ ] define column convert methods for different dtypes (int, float, string, boolean, datetime, dictionary) - [ ] Implement from_dataframe method Example: <details> ```python import pandas as pd import numpy as np import math from datetime import datetime as dt arr = ["Mon", "Tue", "Mon", "Wed", "Mon", "Thu", "Fri", "Sat", None] df = pd.DataFrame( {"weekday": arr} ) df = df.astype("category") dfX = df.__dataframe__() colX = dfX.get_column(0) from pyarrow.interchange.from_dataframe import categorical_column_to_dictionary categorical_column_to_dictionary(colX) #(<pyarrow.lib.DictionaryArray object at 0x11f6c69e0> # # -- dictionary: # [ # "Fri", # "Mon", # "Sat", # "Thu", # "Tue", # "Wed" # ] # -- indices: # [ # 1, # 4, # 1, # 5, # 1, # 3, # 0, # 2, # null # ], {'data': (PandasBuffer({'bufsize': 9, 'ptr': 4981046848, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 8, 'c', '|')), 'validity': None, 'offsets': None}) df = pd.DataFrame( {"x": [True, True, False], "y": [1, 2, 0], "z": [9.2, 10.5, math.nan], "Y": ["a", "b", None], "dt": [None, dt(2007, 7, 14), dt(2007, 7, 15)]} ) dfX = df.__dataframe__() colX = dfX.get_column(0) from pyarrow.interchange.from_dataframe import bool_8_column_to_array bool_8_column_to_array(colX) # (<pyarrow.lib.BooleanArray object at 0x11f656fa0> # [ # true, # true, # false # ], {'data': (PandasBuffer({'bufsize': 3, 'ptr': 5518890672, 'device': 'CPU'}), (<DtypeKind.BOOL: 20>, 8, 'b', '|')), 'validity': None, 'offsets': None}) colX = dfX.get_column(1) from pyarrow.interchange.from_dataframe import column_to_array column_to_array(colX) # (<pyarrow.lib.Int64Array object at 0x11f683640> # [ # 1, # 2, # 0 # ], {'data': (PandasBuffer({'bufsize': 24, 'ptr': 5519025920, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '=')), 'validity': None, 'offsets': None}) colX = dfX.get_column(2) from pyarrow.interchange.from_dataframe import column_to_array column_to_array(colX) # (<pyarrow.lib.DoubleArray object at 0x11f656d00> # [ # 9.2, # 10.5, # nan # ], {'data': (PandasBuffer({'bufsize': 24, 'ptr': 5249481744, 'device': 'CPU'}), (<DtypeKind.FLOAT: 2>, 64, 'g', '=')), 'validity': None, 'offsets': None}) colX = dfX.get_column(3) from pyarrow.interchange.from_dataframe import column_to_array column_to_array(colX) # (<pyarrow.lib.StringArray object at 0x11f6dc580> # [ # "a", # "b", # null # ], {'data': (PandasBuffer({'bufsize': 2, 'ptr': 4792270768, 'device': 'CPU'}), (<DtypeKind.STRING: 21>, 8, 'u', '=')), 'validity': (PandasBuffer({'bufsize': 3, 'ptr': 5518765136, 'device': 'CPU'}), (<DtypeKind.BOOL: 20>, 8, 'b', '=')), 'offsets': (PandasBuffer({'bufsize': 32, 'ptr': 5518864576, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))}) colX = dfX.get_column(4) from pyarrow.interchange.from_dataframe import datetime_column_to_array datetime_column_to_array(colX) # <pyarrow.lib.TimestampArray object at 0x11f656fa0> # [ # null, # 2007-07-14 00:00:00.000000, # 2007-07-15 00:00:00.000000 # ] ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
