Re: parquet-cpp development

Wes McKinney Thu, 28 Jul 2016 09:54:55 -0700

On Thu, Jul 28, 2016 at 6:47 AM, Ryan Lewis <[email protected]> wrote:
> Hi Wes,
>
> I can absolutely do that sure. Let me come up with some examples and write
> back. You might be interested to read this, which is still a WIP, but,
> most C++ developers follow these guidelines (even before there was a core
> guidelines)
>
> https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md
>
> I guess everything I am saying is just a guideline:
> https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rsl-lib
>


I agree that wheels should not be reinvented, but it needs to be
demonstrated whether this is the case. The linked guidelines document
states "Prefer the standard library to other libraries". What would
that prescribe here?

The requirements for memory allocation in parquet-cpp have been that:

1. Users have a clear way to define custom allocators (perhaps these
would be better named "memory pools"). Here is an example of an
allocator that hooks into an (also possibly user-defined) abstract
Arrow memory pool
https://github.com/apache/arrow/blob/master/cpp/src/arrow/parquet/io.h#L41

2. The user memory allocator / pool type may not be known at compile
time by the code in parquet-cpp that uses it

It's possible that requirement #2 may not be necessary or desirable,
but this could potentially result in a significant inversion of the
code base into templates / header files (i.e. if the user allocator
type must cascade to each place in the code where memory is allocated)
from the current structure of primarily compiled object code utilizing
abstract interfaces.

The parquet::MemoryAllocator is more about providing for memory
accounting than it is about replicating any functionality in the STL.
However, I recall this commit:

https://github.com/apache/parquet-cpp/commit/225fba78b#diff-7e6be4c0cc4230e23b71a0bf6f42b1a9R147

Since C++11 allows for stateful allocators it would be preferable to
implement an STL-conforming allocator that requests memory from an
abstract parquet::MemoryAllocator (like the Kudu example I cited) so
that STL templates like std::vector can be used without hiding
allocations from the user accounting (in some applications peak memory
use or memory limits may be important -- e.g. Impala and Vertica track
memory use carefully).

- Wes

>
> On Wed, Jul 27, 2016 at 11:25 PM Wes McKinney <[email protected]> wrote:
>
>> hi Ryan,
>>
>> comments inline
>>
>> On Wed, Jul 27, 2016 at 1:36 PM, Ryan Lewis <[email protected]> wrote:
>> > Hi,
>> >
>> > I am interested in contributing to parquet-cpp. I would like to introduce
>> > myself to the developer community, and as per suggestion from a private
>> > conversation with Wes McKinney, briefly discuss where I think I can add
>> > some value. What I have noticed is that the library is not leveraging the
>> > C++ standard library, and instead is recreating standard behavior. Two
>> > concrete examples are the InputStream class and its children, and the
>> > Buffer objects. These are perhaps modeled by the input iterator concept:
>> > http://en.cppreference.com/w/cpp/concept/InputIterator (or ideally some
>> > refinement of it) and the allocator concept
>> > http://en.cppreference.com/w/cpp/concept/Allocator. Components of the
>> > standard library which implements these concepts are listed in the
>> previous
>> > links.
>> >
>>
>> Can you describe or give some concrete examples of how places where
>> the current parquet-cpp MemoryAllocator concept could be altered to
>> use a stateful allocator implementation conforming to std::allocator /
>> std::allocator_traits? It is not clear to me that we are recreating
>> standard behavior; we have a working approach to user-defined memory
>> allocation and tracking that is distinct from the STL approach, even
>> if there is semantic functional overlap. We have discussed creating a
>> conforming STL allocator containing a pointer to a particular
>> parquet::MemoryAllocator in its state, and I think this is a good idea
>> independently (we are using the default allocator in a number of
>> places, which is bad because these allocations may not be known to the
>> user's custom allocator).
>>
>> The intent of these classes is to provide users a way to define custom
>> implementations / subclasses of concepts that are not known to the
>> library at compile time (and instead using dynamic dispatch / virtual
>> functions to invoke the correct symbols). The risk I see is that if
>> you consistently use the STL template-based approach to defining
>> custom allocators, then this will cascade throughout much of the
>> library (since any current class which utilizes
>> parquet::MemoryAllocator will now need to know the type of the user
>> allocator at compile time). If parquet-cpp were a self-contained
>> system (rather than intended as a pluggable component), it would be a
>> different story. Maybe I'm thinking about it wrong, this is not my
>> area of expertise, having only browsed the STL documentation on custom
>> allocators.
>>
>> Here is an example of a more advanced memory management subsystem in
>> Apache Kudu (see the abstract interface in BufferAllocator):
>>
>>
>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/util/memory/memory.h
>>
>> See also the STL-compliant allocator that hooks into Kudu's memory arenas
>> here:
>>
>>
>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/util/memory/arena.h
>> Usage:
>> https://github.com/apache/incubator-kudu/blob/master/src/kudu/tablet/tablet_metrics.cc#L275
>>
>> (note that Kudu is a self-contained system so not apples-to-apples)
>>
>> The parquet::Buffer seems slightly orthogonal to memory allocators --
>> this was intended to make the semantics of memory ownership and
>> deallocation more flexible where needed.
>>
>> > There is a decent amount of code in parquet-cpp and so its a bit
>> > challenging for me to provide a constructive thought on how to perform
>> > refactors (which I would happily do), to have the library use the
>> standard
>> > library more, however, I also don't want to do something if it is not a
>> > welcome change. Therefore feedback from the developer community would be
>> > beneficial.
>>
>> I would like to understand the argument a bit better and see some code
>> specifics. Would be great to get other C++ developers in the ASF
>> community to weigh in once we've identified the central pros / cons.
>>
>> Thanks
>> Wes
>>
>> >
>> > Best,
>> > -rhl
>>

Re: parquet-cpp development

Reply via email to