> This will break a lot of stuff, e.g. a lot of the IPC read path 
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc) 
> would have to be rewritten.

Just to highlight a point from the IPC part of the project -- the
entire read path for record batches does not touch the Array
subclasses. Only instances of ArrayData are emitted. A lot of effort
has gone into achieving the current level of tidiness in this part of
the codebase. I'm all for making the Array accessor classes more
usable if we can do so without needlessly breaking other things that
are working well right now.

On Sat, May 19, 2018 at 11:49 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> On Sat, May 19, 2018 at 10:28 PM, Joshua Storck <joshuasto...@gmail.com> 
> wrote:
>> In regards to the CopyData function that you mentioned in util-internal.h,
>> I don't think you need an ArrayData class to achieve that functionality. If
>> that data was in the Array base class, that function could have been
>> defined with either of these signatures:
>>
>> template <typename I, typename O>
>> void CopyData(const I& in, const O* out);
>>
>> void CopyData(const Array& in, const Array* out);
>>
>
> The Array types are all immutable; I'm not in favor of adding public
> APIs for mutating the internals (unless you grab the
> internal::ArrayData and enter the "no seatbelts zone").
>
> We have the very challenging task of
>
> * Building interfaces to facilitate library development by Arrow development
> * Provide usable interfaces for "end users" (other projects who are
> consuming the Arrow public API)
>
> I don't think any approach will be without tradeoffs.
>
>> In regards to child data, there's already a specialization for Lists, so
>> I'm not sure how ArrayData helps.
>
> If you take the data type out of the ArrayData, where do you keep
> track of the child metadata? In the document you say "ownership of the
> shared pointer to the DataType must be moved from the ArrayData class
> to the Array sub-classes". This will break a lot of stuff, e.g. a lot
> of the IPC read path
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc)
> would have to be rewritten.
>
>>
>> It might be best to go over some code in person as we might be talking
>> across each other.
>
> Let me look at the document more and also look at your parquet-cpp PR
> to get more context. There are some statements in the document that I
> disagree with (like "The MakeArray function is currently serving as a
> copy constructor." This isn't quite the case -- MakeArray constructs a
> concrete accessor type, like StringArray or ListArray, from generic
> ArrayData) -- I can add more detailed comments in the Google doc
>
> It would be good for some others to give some feedback. My general
> reaction is that the collateral damage / refactoring required for this
> change would be pretty high.
>
> - Wes
>
>>
>> On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> hi Josh,
>>>
>>> On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <joshuasto...@gmail.com>
>>> wrote:
>>> > I've updated the document with an extra section to include code snippets
>>> > for alternative designs to address concerns raised here on the mailing
>>> list.
>>>
>>> Thanks, I will look in more detail and reply more later
>>>
>>> >
>>> > MSVC 19 2017 supports covariant return types, and the code presented
>>> above
>>> > successfully compiled on gcc.godbolt.org using that compiler.
>>> >
>>> > I would agree that the covariant_{shared, derived}_ptr classes themselves
>>> > are a bit complicated, and I wasn't particularly happy with the fact that
>>> > there's an extra pointer floating around. I've presented two alternatives
>>> > in the document that would eliminate the need for those classes. The
>>> first
>>> > solution makes the Array::type() function pure virtual and has
>>> sub-classes
>>> > implement without covariant return types. Instead, the sub-classes
>>> provide
>>> > a function with a different name that returns the concrete data type
>>> > sub-class (e.g. - ListArray::list_type()). The second solution uses C++
>>> > name hiding. In that design, the Array::type() function is not virtual
>>> and
>>> > calls through to a pure virtual function called get_type(). The subclass
>>> > would implement get_type() and simply return a shared pointer to a
>>> DataType
>>> > object, but would also name hide Array::type() with its own
>>> implementation
>>> > that returns a shared pointer the sub-class data type. While the name
>>> > hiding keeps the API cleaner (i.e. - you always call type()), I find that
>>> > it's one of the less used and more obscure features of C++ that tends to
>>> > confuse more novice to intermediate C++ programmers.
>>>
>>> I might be missing something, but what would be wrong with having a new
>>> method
>>>
>>> virtual const DataType& concrete_type() const = 0;
>>>
>>> and
>>>
>>> const T& concrete_type() const;
>>>
>>> in subclasses? Then you don't have to write down a static cast yourself
>>>
>>> >
>>> > The design does not suggest removing the ArrayData class, but instead
>>> > removing the type_ member. It sounds like for the ArrayData class you are
>>> > trying to eschew the C++ type hierarchy and move to a more C oriented
>>> > design with type punning. I think the compromise here would be to have
>>>
>>> The classes in array.h are strongly-typed accessor interfaces rather
>>> than primary data structures, at least the way the code is written
>>> right now.
>>>
>>> The objective of internal::ArrayData is to have a canonical POD data
>>> structure that is the same for all arrays and for each concrete Array
>>> subclass to have a standard constructor. It additionally allows
>>> functions to be written or generated that accept and return ArrayData.
>>> This lets us avoid dynamic dispatch already in a number of places.
>>>
>>> The need for ArrayData as it is now arose pretty much right away when
>>> we started writing the Cast kernels in
>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
>>> This makes it possible for functions to interact uniformly with an
>>> Array's internal data without necessarily having to specialize on the
>>> subtype. As an example, we have functions like
>>> https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e
>>> 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
>>> which are agnostic to the concrete type, which is handled elsewhere.
>>>
>>> My guess is removing the complete type object from ArrayData would
>>> require significant refactoring of arrow/compute. I'm not sure where
>>> the child metadata would be tracked, then (right now it's in
>>> ArrayData::child_data).
>>>
>>> As an aside, I had considered instead of ArrayData having its fields
>>> be on the base Array type. The problem with this is that in internal
>>> library, if you need to manipulate the data in an array (like the
>>> type, null count, or buffers), you have some problems
>>>
>>> * You must work in the domain of concrete subtypes, since constructing
>>> the base Array should not be allowed
>>> * You may end up defining an auxiliary data structure just like
>>> ArrayData, and writing a function to copy the fields of each concrete
>>> Array type into that auxiliary data structure
>>>
>>> I admit that the need for ArrayData in its current form is not obvious
>>> at first glance. In libraries like TensorFlow, things are vastly
>>> simpler because the multidimensional array memory model is so much
>>> simpler -- you can write functions that take "const Tensor&" for any
>>> value type, whereas in Arrow the data structure depends on the runtime
>>> type.
>>>
>>> To summarize my view at the moment (and I will look in more detail to
>>> understand the problems and solutions better), I would like to see
>>> significantly more development progress in arrow/compute and see what
>>> other kinds of problems we run into in that code, which will feature a
>>> great deal of dynamic dispatch / kernel selection based on various
>>> runtime type and hardware information.
>>>
>>> > ArrayData store the type id but not the shared pointer to the DataType.
>>> You
>>> > can take that approach one step further, and make the ArrayData a C
>>> struct
>>> > with its own API that would successfully compile with a C compiler. You
>>> > could then add a C++ wrapper that deals with that class and is used by
>>> the
>>> > various C++ classes in the Array/DataType type hierarchy. The benefit of
>>> > taking that approach is that you can easily inter-operate between C++ and
>>> > C. In other words, a C only arrow API would be using the richer C++ API
>>> > underneath, but the user of the API would only be able to obtain or pass
>>> > ArrayData C structs from/to the API.
>>>
>>> This would be pretty tricky right now because of memory management
>>> (i.e. buffer subclasses). One could effectively implement a virtual
>>> buffer interface in C but then we're reinventing wheels that C++
>>> provides us.
>>>
>>> >
>>> > This proposal was in no way trying to address the static / compile-time
>>> > problem. It was specifically targeted at making it easier to obtain
>>> > concrete data types in the dynamic schema setting when dealing with
>>> > arbitrary nesting of types. I do like the idea of a compile-time
>>> hierarchy
>>> > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
>>> > StaticListArray<StaticInt32Array> >), but that will require a lot of
>>> > template meta-programming. If the covariant_{shared,derived}_ptr classes
>>> > above seem too complicated, I'm pretty sure that a compile-time hierarchy
>>> > will seem immensely more complicated. Bottom line: I would want to see a
>>> > concrete use case before implementing a compile-time type hierarchy with
>>> > templates.
>>> >
>>>
>>> Got it, I agree that the compile-time hierarchy is a separate matter.
>>>
>>> - Wes
>>>
>>> > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >
>>> >> hi Josh,
>>> >>
>>> >> Thank you for putting together this document. These changes are
>>> >> interesting and worth giving some consideration. Some initial
>>> >> reactions / questions to get a discussion going:
>>> >>
>>> >> * How is MSVC support for covariant return types? It doesn't look like
>>> >> it's supported, which (I'm sorry to say) would be a deal breaker
>>> >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
>>> >> * I don't think creating a situation where arrow::DataType objects
>>> >> have to be copied is a good idea. Schemas and types will be reused,
>>> >> particular in a streaming data setting
>>> >> * Adding a covariant shared_ptr seems very complicated; it seems the
>>> >> primary benefit will be in code where we would currently have to
>>> >> introduce a dynamic visitor pattern. Could the same benefit be reaped
>>> >> by adding a _new_ virtual base method that returns a pointer or
>>> >> const-ref to the concrete type? This is a less invasive change
>>> >> * I think it is beneficial to have a simple internal data structure
>>> >> (i.e. ArrayData) that describes any Arrow array object
>>> >>
>>> >> From reading this document, I am wondering if it would not be worth
>>> >> thinking about creating a from-scratch set of array and type data
>>> >> structures intended for use in a static / compile-time setting. Since
>>> >> Arrow features "dynamic" schemas (in the sense that Avro is a more
>>> >> dynamic version of Protocol Buffers or Thrift), we inevitably will be
>>> >> dealing with the problem of dynamic dispatch from type information
>>> >> known only at runtime. I agree that it would be useful to eliminate
>>> >> usages of dynamic visitor pattern in the codebase where possible.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Sat, May 19, 2018 at 12:48 AM, joshuasto...@gmail.com
>>> >> <joshuasto...@gmail.com> wrote:
>>> >> > I've put together a proposal for using covariant return types in
>>> >> Array::type() in the C++ library. I wanted to get some feedback before
>>> >> putting together a PR in case it's too controversial or would require to
>>> >> much re-factoring of the code:
>>> >> >
>>> >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
>>> >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
>>> >> >
>>> >> > It would be nice to use Google Doc's comment feature, but I would
>>> >> imagine it may be safer to memorialize the discussion here on the
>>> mailing
>>> >> list.
>>> >>
>>>

Reply via email to