In regards to your question about returning DataType&, one of the goals of
the proposal is to pass shared pointers to concrete data type sub-classes
in array constructors to increase type safety. If you return a reference in
the type() function, you would need to make a copy of the data type to get
a shared pointer to pass to the array sub-class constructor. That would be
violating one of the design principles you stated earlier, that creating a
situation where arrow::DataType objects have to be copied is undesirable.

On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Josh,
>
> On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <joshuasto...@gmail.com>
> wrote:
> > I've updated the document with an extra section to include code snippets
> > for alternative designs to address concerns raised here on the mailing
> list.
>
> Thanks, I will look in more detail and reply more later
>
> >
> > MSVC 19 2017 supports covariant return types, and the code presented
> above
> > successfully compiled on gcc.godbolt.org using that compiler.
> >
> > I would agree that the covariant_{shared, derived}_ptr classes themselves
> > are a bit complicated, and I wasn't particularly happy with the fact that
> > there's an extra pointer floating around. I've presented two alternatives
> > in the document that would eliminate the need for those classes. The
> first
> > solution makes the Array::type() function pure virtual and has
> sub-classes
> > implement without covariant return types. Instead, the sub-classes
> provide
> > a function with a different name that returns the concrete data type
> > sub-class (e.g. - ListArray::list_type()). The second solution uses C++
> > name hiding. In that design, the Array::type() function is not virtual
> and
> > calls through to a pure virtual function called get_type(). The subclass
> > would implement get_type() and simply return a shared pointer to a
> DataType
> > object, but would also name hide Array::type() with its own
> implementation
> > that returns a shared pointer the sub-class data type. While the name
> > hiding keeps the API cleaner (i.e. - you always call type()), I find that
> > it's one of the less used and more obscure features of C++ that tends to
> > confuse more novice to intermediate C++ programmers.
>
> I might be missing something, but what would be wrong with having a new
> method
>
> virtual const DataType& concrete_type() const = 0;
>
> and
>
> const T& concrete_type() const;
>
> in subclasses? Then you don't have to write down a static cast yourself
>
> >
> > The design does not suggest removing the ArrayData class, but instead
> > removing the type_ member. It sounds like for the ArrayData class you are
> > trying to eschew the C++ type hierarchy and move to a more C oriented
> > design with type punning. I think the compromise here would be to have
>
> The classes in array.h are strongly-typed accessor interfaces rather
> than primary data structures, at least the way the code is written
> right now.
>
> The objective of internal::ArrayData is to have a canonical POD data
> structure that is the same for all arrays and for each concrete Array
> subclass to have a standard constructor. It additionally allows
> functions to be written or generated that accept and return ArrayData.
> This lets us avoid dynamic dispatch already in a number of places.
>
> The need for ArrayData as it is now arose pretty much right away when
> we started writing the Cast kernels in
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
> This makes it possible for functions to interact uniformly with an
> Array's internal data without necessarily having to specialize on the
> subtype. As an example, we have functions like
> https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e
> 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
> which are agnostic to the concrete type, which is handled elsewhere.
>
> My guess is removing the complete type object from ArrayData would
> require significant refactoring of arrow/compute. I'm not sure where
> the child metadata would be tracked, then (right now it's in
> ArrayData::child_data).
>
> As an aside, I had considered instead of ArrayData having its fields
> be on the base Array type. The problem with this is that in internal
> library, if you need to manipulate the data in an array (like the
> type, null count, or buffers), you have some problems
>
> * You must work in the domain of concrete subtypes, since constructing
> the base Array should not be allowed
> * You may end up defining an auxiliary data structure just like
> ArrayData, and writing a function to copy the fields of each concrete
> Array type into that auxiliary data structure
>
> I admit that the need for ArrayData in its current form is not obvious
> at first glance. In libraries like TensorFlow, things are vastly
> simpler because the multidimensional array memory model is so much
> simpler -- you can write functions that take "const Tensor&" for any
> value type, whereas in Arrow the data structure depends on the runtime
> type.
>
> To summarize my view at the moment (and I will look in more detail to
> understand the problems and solutions better), I would like to see
> significantly more development progress in arrow/compute and see what
> other kinds of problems we run into in that code, which will feature a
> great deal of dynamic dispatch / kernel selection based on various
> runtime type and hardware information.
>
> > ArrayData store the type id but not the shared pointer to the DataType.
> You
> > can take that approach one step further, and make the ArrayData a C
> struct
> > with its own API that would successfully compile with a C compiler. You
> > could then add a C++ wrapper that deals with that class and is used by
> the
> > various C++ classes in the Array/DataType type hierarchy. The benefit of
> > taking that approach is that you can easily inter-operate between C++ and
> > C. In other words, a C only arrow API would be using the richer C++ API
> > underneath, but the user of the API would only be able to obtain or pass
> > ArrayData C structs from/to the API.
>
> This would be pretty tricky right now because of memory management
> (i.e. buffer subclasses). One could effectively implement a virtual
> buffer interface in C but then we're reinventing wheels that C++
> provides us.
>
> >
> > This proposal was in no way trying to address the static / compile-time
> > problem. It was specifically targeted at making it easier to obtain
> > concrete data types in the dynamic schema setting when dealing with
> > arbitrary nesting of types. I do like the idea of a compile-time
> hierarchy
> > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
> > StaticListArray<StaticInt32Array> >), but that will require a lot of
> > template meta-programming. If the covariant_{shared,derived}_ptr classes
> > above seem too complicated, I'm pretty sure that a compile-time hierarchy
> > will seem immensely more complicated. Bottom line: I would want to see a
> > concrete use case before implementing a compile-time type hierarchy with
> > templates.
> >
>
> Got it, I agree that the compile-time hierarchy is a separate matter.
>
> - Wes
>
> > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> >> hi Josh,
> >>
> >> Thank you for putting together this document. These changes are
> >> interesting and worth giving some consideration. Some initial
> >> reactions / questions to get a discussion going:
> >>
> >> * How is MSVC support for covariant return types? It doesn't look like
> >> it's supported, which (I'm sorry to say) would be a deal breaker
> >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
> >> * I don't think creating a situation where arrow::DataType objects
> >> have to be copied is a good idea. Schemas and types will be reused,
> >> particular in a streaming data setting
> >> * Adding a covariant shared_ptr seems very complicated; it seems the
> >> primary benefit will be in code where we would currently have to
> >> introduce a dynamic visitor pattern. Could the same benefit be reaped
> >> by adding a _new_ virtual base method that returns a pointer or
> >> const-ref to the concrete type? This is a less invasive change
> >> * I think it is beneficial to have a simple internal data structure
> >> (i.e. ArrayData) that describes any Arrow array object
> >>
> >> From reading this document, I am wondering if it would not be worth
> >> thinking about creating a from-scratch set of array and type data
> >> structures intended for use in a static / compile-time setting. Since
> >> Arrow features "dynamic" schemas (in the sense that Avro is a more
> >> dynamic version of Protocol Buffers or Thrift), we inevitably will be
> >> dealing with the problem of dynamic dispatch from type information
> >> known only at runtime. I agree that it would be useful to eliminate
> >> usages of dynamic visitor pattern in the codebase where possible.
> >>
> >> - Wes
> >>
> >> On Sat, May 19, 2018 at 12:48 AM, joshuasto...@gmail.com
> >> <joshuasto...@gmail.com> wrote:
> >> > I've put together a proposal for using covariant return types in
> >> Array::type() in the C++ library. I wanted to get some feedback before
> >> putting together a PR in case it's too controversial or would require to
> >> much re-factoring of the code:
> >> >
> >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
> >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
> >> >
> >> > It would be nice to use Google Doc's comment feature, but I would
> >> imagine it may be safer to memorialize the discussion here on the
> mailing
> >> list.
> >>
>

Reply via email to