Wes McKinney created ARROW-1681:
-----------------------------------
Summary: [Python] Error writing with nulls in lists
Key: ARROW-1681
URL: https://issues.apache.org/jira/browse/ARROW-1681
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.7.1
Reporter: Wes McKinney
Fix For: 0.8.0
Created from https://github.com/apache/arrow/issues/1208
Hi,
Not sure if this is related or the same as ARROW-1584, but I can't seem to find
a way to handle arrays of lists which occasionally consist of empty lists only.
To reproduce:
{code}
na = [] # None, [""]
arrays = {
'c1': pa.array([["test"], na, na], type=pa.list_(pa.string())),
'c2': pa.array([na, na, na], type=pa.list_(pa.string())),
}
rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
df = rb.to_pandas()
pa.serialize_pandas(df)
# > ArrowNotImplementedError: Unable to convert type: null
tbl = pa.Table.from_pandas(df)
sink = pa.BufferOutputStream()
writer = pa.RecordBatchFileWriter(sink, tbl.schema)
writer.write_table(tbl)
# > ArrowNotImplementedError: Unable to convert type: null
{code}
In my use case I'm processing data in batches where individual fields contain
lists of strings. Some of the batches may, however, contain empty lists only.
And there doesn't seem to be any representation in Arrow at the moment to deal
with this situation.
Also, since I'm serializing the batches into a single file/stream, their
schemas need to be consistent, which is why I tried explicitly specifying the
type of the array as list_(string). The only workaround I've found is to
replace empty lists with [""], but that implies lots of unnecessary glue code
on the client side. Is there a better workaround until this is fixed in an
official conda release?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)