hi folks,

For some time now I have been uncertain about the utility provided by
the arrow::Column C++ class. Fundamentally, it is a container for two
things:

* An arrow::Field object (name and data type)
* An arrow::ChunkedArray object for the data

It was added to the C++ library in ARROW-23 in March 2016 as the basis
for the arrow::Table class which represents a collection of
ChunkedArray objects coming usually from multiple RecordBatches.
Sometimes a Table will have mostly columns with a single chunk while
some columns will have many chunks.

I'm concerned about continuing to maintain the Column class as it's
spilling complexity into computational libraries and bindings alike.

The Python Column class for example mostly forwards method calls to
the underlying ChunkedArray

https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355

If the developer wants to construct a Table or insert a new "column",
Column objects must generally be constructed, leading to boilerplate
without clear benefit.

Since we're discussing building a more significant higher-level
DataFrame interface per past mailing list discussions, my preference
would be to consider removing the Column class to make the user- and
developer-facing data structures simpler. I hate to propose breaking
API changes, so it may not be practical at this point, but I wanted to
at least bring up the issue to see if others have opinions after
working with the library for a few years.

Thanks
Wes

Reply via email to