Here is valid pyarrow code that works right now:
import pyarrow as pa
rb = pa.RecordBatch.from_arrays([
pa.from_pylist([1, 2, 3]),
pa.from_pylist(['foo', 'bar', 'baz'])
], names=['one', 'two'])
batches = [rb, rb.slice(0, 0)]
stream = pa.InMemoryOutputStream()
writer = pa.StreamWriter(stream, rb.schema)
for batch in batches:
writer.write_batch(batch)
writer.close()
reader = pa.StreamReader(stream.get_result())
results = [reader.get_next_batch(), reader.get_next_batch()]
With the proposal to disallow length-0 batches, where should this break?
Probably StreamWriter.write_batch should raise ValueError, but now we have
to write:
for batch in batches:
if len(batch) > 0:
writer.write_batch(batch)
That seems worse, because now the user has to think about batch sizes. When
we write:
pa.Table.from_batches(results).to_pandas()
the 0 length batches get skipped over anyhow
onetwo
0 1 foo
1 2 bar
2 3 baz
If you pass only zero-length batches, you get
onetwo
There are plenty of reasons why things would end up zero-length, like:
* Result of a predicate evaluation that filtered out all the data
* Files (e.g. Parquet files) with 0 rows
My concern is being able to faithfully represent the in-memory results of
operations in an RPC/IPC setting. Having to deal with a null / no message
is much worse for us, because we will in many cases still have to construct
a length-0 RecordBatch in C++; the question is whether we're letting the
IPC loader do it or having to construct a "dummy" object based on the
schema.
On Fri, Apr 14, 2017 at 5:55 PM, Jacques Nadeau <[email protected]> wrote:
> Hey All,
>
> I had a quick comment on ARROW-783 that Wes responded to and I wanted to
> elevate the conversation here for a moment.
>
> My suggestion there was that we should disallow zero-length batches.
>
> Wes thought that should be an application level concern. I wanted to see
> what others thought.
>
> My general perspective is that zero-length batches are meaningless and
> better to disallow than make every application have special handling for
> them. In the jira Wes noted that they have to deal with zero-length
> dataframes. Despite that, I don't think there is a requirement that there
> should be a 1:1 mapping between arrow record batches and dataframes. If
> someone wants to communicate empty things, no need for them to use Arrow.
>
> What do others think?
>