[ https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-1382: -------------------------------- Fix Version/s: (was: 0.10.0) 0.11.0 > [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize > --------------------------------------------------------------------------- > > Key: ARROW-1382 > URL: https://issues.apache.org/jira/browse/ARROW-1382 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Robert Nishihara > Priority: Major > Fix For: 0.11.0 > > > If a Python object appears multiple times within a list/tuple/dictionary, > then when pyarrow serializes the object, it will duplicate the object many > times. This leads to a potentially huge expansion in the size of the object > (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 > times bigger than it needs to be). > {code} > import pyarrow as pa > l = [0] > original_object = [l, l] > # Serialize and deserialize the object. > buf = pa.serialize(original_object).to_buffer() > new_object = pa.deserialize(buf) > # This works. > assert original_object[0] is original_object[1] > # This fails. > assert new_object[0] is new_object[1] > {code} > One potential way to address this is to use the Arrow dictionary encoding. -- This message was sent by Atlassian JIRA (v7.6.3#76005)