Robert Nishihara created ARROW-1382: ---------------------------------------
Summary: Python objects containing multiple copies of the same object are serialized incorrectly Key: ARROW-1382 URL: https://issues.apache.org/jira/browse/ARROW-1382 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Robert Nishihara If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 times bigger than it needs to be). {code} import pyarrow as pa l = [0] original_object = [l, l] # Serialize and deserialize the object. buf = pa.serialize(original_object).to_buffer() new_object = pa.deserialize(buf) # This works. assert original_object[0] is original_object[1] # This fails. assert new_object[0] is new_object[1] {code} One potential way to address this is to use the Arrow dictionary encoding. -- This message was sent by Atlassian JIRA (v6.4.14#64029)