[
https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134303#comment-16134303
]
Robert Nishihara commented on ARROW-1382:
-----------------------------------------
A related example is an object that recursively contains itself (this example
is a bit contrived, but you could imagine a graph data structure with cyclic
references).
{code}
import pyarrow as pa
l = []
original_object = l.append(l)
# Serialize the object. This fails.
pa.serialize(original_object)
{code}
The {{pa.serialize}} call fails with
{code}
ArrowException: Unknown error: 'NoneType' object is not iterable
{code}
The error really should be
{code}
ArrowNotImplementedError: This object exceeds the maximum recursion depth. It
may contain itself recursively.
{code}
That's the error you run the following
{code}
import pyarrow as pa
l1 = []
l2 = []
l1.append(l2)
l2.append(l1)
# This fails.
pa.serialize(l1)
{code}
> Python objects containing multiple copies of the same object are serialized
> incorrectly
> ---------------------------------------------------------------------------------------
>
> Key: ARROW-1382
> URL: https://issues.apache.org/jira/browse/ARROW-1382
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Robert Nishihara
>
> If a Python object appears multiple times within a list/tuple/dictionary,
> then when pyarrow serializes the object, it will duplicate the object many
> times. This leads to a potentially huge expansion in the size of the object
> (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100
> times bigger than it needs to be).
> {code}
> import pyarrow as pa
> l = [0]
> original_object = [l, l]
> # Serialize and deserialize the object.
> buf = pa.serialize(original_object).to_buffer()
> new_object = pa.deserialize(buf)
> # This works.
> assert original_object[0] is original_object[1]
> # This fails.
> assert new_object[0] is new_object[1]
> {code}
> One potential way to address this is to use the Arrow dictionary encoding.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)