hi Ryan, comments inline
On Wed, Jul 27, 2016 at 1:36 PM, Ryan Lewis <[email protected]> wrote: > Hi, > > I am interested in contributing to parquet-cpp. I would like to introduce > myself to the developer community, and as per suggestion from a private > conversation with Wes McKinney, briefly discuss where I think I can add > some value. What I have noticed is that the library is not leveraging the > C++ standard library, and instead is recreating standard behavior. Two > concrete examples are the InputStream class and its children, and the > Buffer objects. These are perhaps modeled by the input iterator concept: > http://en.cppreference.com/w/cpp/concept/InputIterator (or ideally some > refinement of it) and the allocator concept > http://en.cppreference.com/w/cpp/concept/Allocator. Components of the > standard library which implements these concepts are listed in the previous > links. > Can you describe or give some concrete examples of how places where the current parquet-cpp MemoryAllocator concept could be altered to use a stateful allocator implementation conforming to std::allocator / std::allocator_traits? It is not clear to me that we are recreating standard behavior; we have a working approach to user-defined memory allocation and tracking that is distinct from the STL approach, even if there is semantic functional overlap. We have discussed creating a conforming STL allocator containing a pointer to a particular parquet::MemoryAllocator in its state, and I think this is a good idea independently (we are using the default allocator in a number of places, which is bad because these allocations may not be known to the user's custom allocator). The intent of these classes is to provide users a way to define custom implementations / subclasses of concepts that are not known to the library at compile time (and instead using dynamic dispatch / virtual functions to invoke the correct symbols). The risk I see is that if you consistently use the STL template-based approach to defining custom allocators, then this will cascade throughout much of the library (since any current class which utilizes parquet::MemoryAllocator will now need to know the type of the user allocator at compile time). If parquet-cpp were a self-contained system (rather than intended as a pluggable component), it would be a different story. Maybe I'm thinking about it wrong, this is not my area of expertise, having only browsed the STL documentation on custom allocators. Here is an example of a more advanced memory management subsystem in Apache Kudu (see the abstract interface in BufferAllocator): https://github.com/apache/incubator-kudu/blob/master/src/kudu/util/memory/memory.h See also the STL-compliant allocator that hooks into Kudu's memory arenas here: https://github.com/apache/incubator-kudu/blob/master/src/kudu/util/memory/arena.h Usage: https://github.com/apache/incubator-kudu/blob/master/src/kudu/tablet/tablet_metrics.cc#L275 (note that Kudu is a self-contained system so not apples-to-apples) The parquet::Buffer seems slightly orthogonal to memory allocators -- this was intended to make the semantics of memory ownership and deallocation more flexible where needed. > There is a decent amount of code in parquet-cpp and so its a bit > challenging for me to provide a constructive thought on how to perform > refactors (which I would happily do), to have the library use the standard > library more, however, I also don't want to do something if it is not a > welcome change. Therefore feedback from the developer community would be > beneficial. I would like to understand the argument a bit better and see some code specifics. Would be great to get other C++ developers in the ASF community to weigh in once we've identified the central pros / cons. Thanks Wes > > Best, > -rhl
