Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Deepak Majeti Sun, 29 Jul 2018 19:44:58 -0700

I dislike the current build system complications as well.

However, in my opinion, combining the code bases will severely impact the
progress of the parquet-cpp project and implicitly the progress of the
entire parquet project.
Combining would have made much more sense if parquet-cpp is a mature
project and codebase.  But parquet-cpp (and the entire parquet project) is
evolving continuously with new features being added including bloom
filters,  column encryption, and indexes.


If the two code bases merged, it will be much more difficult to contribute
to the parquet-cpp project since now Arrow bindings have to be supported as
well. Please correct me if I am wrong here.

Out of the two evils, I think handling the build system, packaging
duplication is much more manageable since they are quite stable at this
point.

Regarding "* API changes cause awkward release coordination issues between
Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
changes needed) as and when Arrow is released?

Regarding "we maintain a Arrow conversion code in parquet-cpp for
converting between Arrow columnar memory format and Parquet". Can this be
moved to the Arrow project and expose the more stable low-level APIs in
parquet-cpp?

I am also curious if the Arrow and Parquet Java implementations have
similar API compatibility issues.


On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi folks,
>
> We've been struggling for quite some time with the development
> workflow between the Arrow and Parquet C++ (and Python) codebases.
>
> To explain the root issues:
>
> * parquet-cpp depends on "platform code" in Apache Arrow; this
> includes file interfaces, memory management, miscellaneous algorithms
> (e.g. dictionary encoding), etc. Note that before this "platform"
> dependency was introduced, there was significant duplicated code
> between these codebases and incompatible abstract interfaces for
> things like files
>
> * we maintain a Arrow conversion code in parquet-cpp for converting
> between Arrow columnar memory format and Parquet
>
> * we maintain Python bindings for parquet-cpp + Arrow interop in
> Apache Arrow. This introduces a circular dependency into our CI.
>
> * Substantial portions of our CMake build system and related tooling
> are duplicated between the Arrow and Parquet repos
>
> * API changes cause awkward release coordination issues between Arrow
> and Parquet
>
> I believe the best way to remedy the situation is to adopt a
> "Community over Code" approach and find a way for the Parquet and
> Arrow C++ development communities to operate out of the same code
> repository, i.e. the apache/arrow git repository.
>
> This would bring major benefits:
>
> * Shared CMake build infrastructure, developer tools, and CI
> infrastructure (Parquet is already being built as a dependency in
> Arrow's CI systems)
>
> * Share packaging and release management infrastructure
>
> * Reduce / eliminate problems due to API changes (where we currently
> introduce breakage into our CI workflow when there is a breaking /
> incompatible change)
>
> * Arrow releases would include a coordinated snapshot of the Parquet
> implementation as it stands
>
> Continuing with the status quo has become unsatisfactory to me and as
> a result I've become less motivated to work on the parquet-cpp
> codebase.
>
> The only Parquet C++ committer who is not an Arrow committer is Deepak
> Majeti. I think the issue of commit privileges could be resolved
> without too much difficulty or time.
>
> I also think if it is truly necessary that the Apache Parquet
> community could create release scripts to cut a miniml versioned
> Apache Parquet C++ release if that is deemed truly necessary.
>
> I know that some people are wary of monorepos and megaprojects, but as
> an example TensorFlow is at least 10 times as large of a projects in
> terms of LOCs and number of different platform components, and it
> seems to be getting along just fine. I think we should be able to work
> together as a community to function just as well.
>
> Interested in the opinions of others, and any other ideas for
> practical solutions to the above problems.
>
> Thanks,
> Wes
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to