Robert Nishihara created ARROW-1382:
---------------------------------------

             Summary: Python objects containing multiple copies of the same 
object are serialized incorrectly
                 Key: ARROW-1382
                 URL: https://issues.apache.org/jira/browse/ARROW-1382
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Robert Nishihara


If a Python object appears multiple times within a list/tuple/dictionary, then 
when pyarrow serializes the object, it will duplicate the object many times. 
This leads to a potentially huge expansion in the size of the object (e.g., the 
serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 times bigger 
than it needs to be).

{code}
import pyarrow as pa

l = [0]
original_object = [l, l]

# Serialize and deserialize the object.
buf = pa.serialize(original_object).to_buffer()
new_object = pa.deserialize(buf)

# This works.
assert original_object[0] is original_object[1]

# This fails.
assert new_object[0] is new_object[1]
{code}

One potential way to address this is to use the Arrow dictionary encoding.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to