hi Josh,

On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <joshuasto...@gmail.com> wrote:
> I've updated the document with an extra section to include code snippets
> for alternative designs to address concerns raised here on the mailing list.

Thanks, I will look in more detail and reply more later

>
> MSVC 19 2017 supports covariant return types, and the code presented above
> successfully compiled on gcc.godbolt.org using that compiler.
>
> I would agree that the covariant_{shared, derived}_ptr classes themselves
> are a bit complicated, and I wasn't particularly happy with the fact that
> there's an extra pointer floating around. I've presented two alternatives
> in the document that would eliminate the need for those classes. The first
> solution makes the Array::type() function pure virtual and has sub-classes
> implement without covariant return types. Instead, the sub-classes provide
> a function with a different name that returns the concrete data type
> sub-class (e.g. - ListArray::list_type()). The second solution uses C++
> name hiding. In that design, the Array::type() function is not virtual and
> calls through to a pure virtual function called get_type(). The subclass
> would implement get_type() and simply return a shared pointer to a DataType
> object, but would also name hide Array::type() with its own implementation
> that returns a shared pointer the sub-class data type. While the name
> hiding keeps the API cleaner (i.e. - you always call type()), I find that
> it's one of the less used and more obscure features of C++ that tends to
> confuse more novice to intermediate C++ programmers.

I might be missing something, but what would be wrong with having a new method

virtual const DataType& concrete_type() const = 0;

and

const T& concrete_type() const;

in subclasses? Then you don't have to write down a static cast yourself

>
> The design does not suggest removing the ArrayData class, but instead
> removing the type_ member. It sounds like for the ArrayData class you are
> trying to eschew the C++ type hierarchy and move to a more C oriented
> design with type punning. I think the compromise here would be to have

The classes in array.h are strongly-typed accessor interfaces rather
than primary data structures, at least the way the code is written
right now.

The objective of internal::ArrayData is to have a canonical POD data
structure that is the same for all arrays and for each concrete Array
subclass to have a standard constructor. It additionally allows
functions to be written or generated that accept and return ArrayData.
This lets us avoid dynamic dispatch already in a number of places.

The need for ArrayData as it is now arose pretty much right away when
we started writing the Cast kernels in
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
This makes it possible for functions to interact uniformly with an
Array's internal data without necessarily having to specialize on the
subtype. As an example, we have functions like
https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
which are agnostic to the concrete type, which is handled elsewhere.

My guess is removing the complete type object from ArrayData would
require significant refactoring of arrow/compute. I'm not sure where
the child metadata would be tracked, then (right now it's in
ArrayData::child_data).

As an aside, I had considered instead of ArrayData having its fields
be on the base Array type. The problem with this is that in internal
library, if you need to manipulate the data in an array (like the
type, null count, or buffers), you have some problems

* You must work in the domain of concrete subtypes, since constructing
the base Array should not be allowed
* You may end up defining an auxiliary data structure just like
ArrayData, and writing a function to copy the fields of each concrete
Array type into that auxiliary data structure

I admit that the need for ArrayData in its current form is not obvious
at first glance. In libraries like TensorFlow, things are vastly
simpler because the multidimensional array memory model is so much
simpler -- you can write functions that take "const Tensor&" for any
value type, whereas in Arrow the data structure depends on the runtime
type.

To summarize my view at the moment (and I will look in more detail to
understand the problems and solutions better), I would like to see
significantly more development progress in arrow/compute and see what
other kinds of problems we run into in that code, which will feature a
great deal of dynamic dispatch / kernel selection based on various
runtime type and hardware information.

> ArrayData store the type id but not the shared pointer to the DataType. You
> can take that approach one step further, and make the ArrayData a C struct
> with its own API that would successfully compile with a C compiler. You
> could then add a C++ wrapper that deals with that class and is used by the
> various C++ classes in the Array/DataType type hierarchy. The benefit of
> taking that approach is that you can easily inter-operate between C++ and
> C. In other words, a C only arrow API would be using the richer C++ API
> underneath, but the user of the API would only be able to obtain or pass
> ArrayData C structs from/to the API.

This would be pretty tricky right now because of memory management
(i.e. buffer subclasses). One could effectively implement a virtual
buffer interface in C but then we're reinventing wheels that C++
provides us.

>
> This proposal was in no way trying to address the static / compile-time
> problem. It was specifically targeted at making it easier to obtain
> concrete data types in the dynamic schema setting when dealing with
> arbitrary nesting of types. I do like the idea of a compile-time hierarchy
> (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
> StaticListArray<StaticInt32Array> >), but that will require a lot of
> template meta-programming. If the covariant_{shared,derived}_ptr classes
> above seem too complicated, I'm pretty sure that a compile-time hierarchy
> will seem immensely more complicated. Bottom line: I would want to see a
> concrete use case before implementing a compile-time type hierarchy with
> templates.
>

Got it, I agree that the compile-time hierarchy is a separate matter.

- Wes

> On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Josh,
>>
>> Thank you for putting together this document. These changes are
>> interesting and worth giving some consideration. Some initial
>> reactions / questions to get a discussion going:
>>
>> * How is MSVC support for covariant return types? It doesn't look like
>> it's supported, which (I'm sorry to say) would be a deal breaker
>> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
>> * I don't think creating a situation where arrow::DataType objects
>> have to be copied is a good idea. Schemas and types will be reused,
>> particular in a streaming data setting
>> * Adding a covariant shared_ptr seems very complicated; it seems the
>> primary benefit will be in code where we would currently have to
>> introduce a dynamic visitor pattern. Could the same benefit be reaped
>> by adding a _new_ virtual base method that returns a pointer or
>> const-ref to the concrete type? This is a less invasive change
>> * I think it is beneficial to have a simple internal data structure
>> (i.e. ArrayData) that describes any Arrow array object
>>
>> From reading this document, I am wondering if it would not be worth
>> thinking about creating a from-scratch set of array and type data
>> structures intended for use in a static / compile-time setting. Since
>> Arrow features "dynamic" schemas (in the sense that Avro is a more
>> dynamic version of Protocol Buffers or Thrift), we inevitably will be
>> dealing with the problem of dynamic dispatch from type information
>> known only at runtime. I agree that it would be useful to eliminate
>> usages of dynamic visitor pattern in the codebase where possible.
>>
>> - Wes
>>
>> On Sat, May 19, 2018 at 12:48 AM, joshuasto...@gmail.com
>> <joshuasto...@gmail.com> wrote:
>> > I've put together a proposal for using covariant return types in
>> Array::type() in the C++ library. I wanted to get some feedback before
>> putting together a PR in case it's too controversial or would require to
>> much re-factoring of the code:
>> >
>> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
>> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
>> >
>> > It would be nice to use Google Doc's comment feature, but I would
>> imagine it may be safer to memorialize the discussion here on the mailing
>> list.
>>

Reply via email to