Zachary Jablons created ARROW-5140:
--------------------------------------

             Summary: [Bug?][Parquet] Can write a jagged array column of 
strings to disk, but hit `ArrowNotImplementedError` on read
                 Key: ARROW-5140
                 URL: https://issues.apache.org/jira/browse/ARROW-5140
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.12.0
         Environment: Debian 8
            Reporter: Zachary Jablons


h1. Description

I encountered an issue on a proprietary dataset where we have a schema that 
looks roughly like:

{\{ |-- ids: array (nullable = true) | |-- element: string (containsNull = 
true) }}

I was able to write this dataset to parquet no problem (using 
{{pq.write_table}}), but upon reading it (using {{pq.read_table}}) I 
encountered the following error: {{ArrowNotImplementedError: Nested data 
conversions not implemented for chunked array outputs}} (a full stacktrace is 
attached below)

I believe that this is pretty confusing because I was able to serialize but not 
deserialize this table. I was able to also find that this does not happen with 
all sizes of the dataset - a smaller sample did not encounter this issue! So I 
built a small reproduction harness and checked out where this could happen:
h2. Further investigation
 * If I set the maximum number of elements per row of {{ids}}, I found that 
reducing it allows me to serialize/deserialize more rows
 * At a setting of maximum 15 elements per row, each element being at most 20 
characters, I fail at about 1.3e5 rows
 * At the limit of my willingness to spend time building giant dataframes to 
investigate this, I haven't been able to reproduce this issue for e.g. longs 
instead of strings
 * Another column in this dataset consists of much longer strings than this 
column's strings (when concatenated), and the total sum of all characters is 
~3x in _that_ column versus this trouble column (when the strings in each row 
are just simply concatenated). I have no issue serializing / deserializing that 
column.
 * The fact that each array is of a different length doesn't seem to matter - 
if I change it so as to force everything to be ~14 elements, it fails with the 
same error even at 1e5 rows.

h1. Reproduction code

This [gist|https://gist.github.com/zmjjmz/1bf738966d2df147a4fae7268ee3d812] 
should have both a stacktrace and reproduction code.
h2. Version info

{\{pyarrow==0.12.0 parquet==1.2 }}
h1. Mea culpa

I copy-pasted this from Github on request 
([https://github.com/apache/arrow/issues/4115]), and Jira formatting is a 
nightmare compared to markdown, so I apologize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to