[jira] [Commented] (ARROW-1784) [Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format

ASF GitHub Bot (JIRA) Tue, 05 Dec 2017 03:13:21 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278410#comment-16278410
 ]


ASF GitHub Bot commented on ARROW-1784:
---------------------------------------

jreback commented on a change in pull request #1390: ARROW-1784: [Python] 
Enable zero-copy serialization, deserialization of pandas.DataFrame via 
components
URL: https://github.com/apache/arrow/pull/1390#discussion_r154838798
 
 

 ##########
 File path: python/pyarrow/pandas_compat.py
 ##########
 @@ -348,25 +349,85 @@ def get_datetimetz_type(values, dtype, type_):
 
     return values, type_
 
+# ----------------------------------------------------------------------
+# Converting pandas.DataFrame to a dict containing only NumPy arrays or other
+# objects friendly to pyarrow.serialize
 
-def make_datetimetz(tz):
+
+def dataframe_to_serialized_dict(frame):
+    block_manager = frame._data
+
+    blocks = []
+    axes = [ax for ax in block_manager.axes]
+
+    for block in block_manager.blocks:
+        values = block.values
+        block_data = {}
+
+        if isinstance(block, _int.DatetimeTZBlock):
+            block_data['timezone'] = values.tz.zone
+            values = values.values
+        elif isinstance(block, _int.CategoricalBlock):
+            block_data.update(dictionary=values.categories,
+                              ordered=values.ordered)
+            values = values.codes
+
+        block_data.update(
+            placement=block.mgr_locs.as_array,
+            block=values
+        )
+        blocks.append(block_data)
+
+    return {
+        'blocks': blocks,
+        'axes': axes
+    }
+
+
+def serialized_dict_to_dataframe(data):
+    reconstructed_blocks = [_reconstruct_block(block)
+                            for block in data['blocks']]
+
+    block_mgr = _int.BlockManager(reconstructed_blocks, data['axes'])
+    return pd.DataFrame(block_mgr)
+
+
+def _reconstruct_block(item):
+    # Construct the individual blocks converting dictionary types to pandas
+    # categorical types and Timestamps-with-timezones types to the proper
+    # pandas Blocks
+
+    block_arr = item['block']
+    placement = item['placement']
+    if 'dictionary' in item:
+        cat = pd.Categorical(block_arr,
 
 Review comment:
   should be ``.from_codes`` as going to deprecate ``fastpath=`` soon

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing 
> the BlockManager rather than coercing to Arrow format
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1784
>                 URL: https://issues.apache.org/jira/browse/ARROW-1784
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> See discussion in https://github.com/dask/distributed/pull/931
> This will permit zero-copy reads for DataFrames not containing Python 
> objects. In the event of an {{ObjectBlock}} these arrays will not be worse 
> than pickle to reconstruct on the receiving side



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1784) [Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format

Reply via email to